feat(ui-server): add opt-in Prometheus /metrics endpoint (closes #851) (#1912)
* feat(ui-server): add opt-in Prometheus /metrics endpoint (closes #851)
Expose simulator state on GET /metrics when uiServer.metrics.enabled=true.
The endpoint is grafted into UIHttpServer.requestListener after the
existing runRequestPrologue (access policy + rate limit) and authenticate
calls, so it inherits per-client rate limiting, host/origin/proxy
validation and basic-auth credentials with no additional surface.
Prometheus is HTTP-only by spec; the flag is honoured only on
uiServer.type=http and AbstractUIServer.warnIfMisconfigured logs a
warning otherwise.
Metric families (exhaustive within the data already exposed via
ChargingStationData and Bootstrap.getState()):
* Global: simulator_info{version}, simulator_started,
simulator_charging_station_templates_total,
simulator_charging_stations_{configured,provisioned,added,started}_total,
simulator_ui_server_known_stations_total,
simulator_template_{added,configured,provisioned,started}{template}.
* Per charge station (label hash_id): simulator_station_info{vendor,
model, firmware_version, ocpp_version, current_out_type},
simulator_station_started, simulator_station_ws_state (numeric 0-3),
simulator_station_connectors_total, simulator_station_evses_total,
simulator_station_max_power_watts, simulator_station_max_amperage_amperes,
simulator_station_voltage_out_volts,
simulator_station_data_timestamp_seconds,
simulator_station_boot_status_info{status},
simulator_station_boot_heartbeat_interval_seconds,
simulator_station_atg_enabled,
simulator_station_diagnostics_status_info{status},
simulator_station_firmware_status_info{status},
simulator_station_ocpp_config_keys_total.
* Per connector (labels hash_id, connector_id):
simulator_connector_status_info{status} (one-hot over
ConnectorStatusEnum), simulator_connector_boot_status_info{status},
simulator_connector_availability_info{availability},
simulator_connector_error_code_info{error_code},
simulator_connector_type_info{connector_type},
simulator_connector_locked,
simulator_connector_transaction_{started,pending,remote_started},
simulator_connector_transaction_seq_no,
simulator_connector_transaction_event_queue_size,
simulator_connector_transaction_id (numeric value only, never label),
simulator_connector_transaction_start_seconds,
simulator_connector_transaction_energy_active_import_register_wh,
simulator_connector_energy_active_import_register_wh,
simulator_connector_max_power_watts,
simulator_connector_charging_profiles_total,
simulator_connector_evse_id, simulator_connector_reservation_active.
PII reject-list (HARD): supervisionUrl, chargeBoxSerialNumber,
chargePointSerialNumber, meterSerialNumber, iccid, imsi,
authorizeIdTag/localAuthorizeIdTag/transactionIdTag/transactionGroupIdToken
and OCPP customData. transactionId / remoteStartId never used as labels
(unbounded cardinality). Adversarial label values are escaped per
prom-client's exposition-format rules so synthesized # HELP / # TYPE
lines cannot be injected.
Cardinality soft cap: a single warning log per scrape when total
samples exceed METRICS_SOFT_SAMPLE_CAP=5000. The response is still
served in full so operators retain observability while being alerted
to fleet growth.
Adds prom-client@^15.1.3 and 17 new tests under
tests/charging-station/ui-server/UIMetricsEndpoint.test.ts covering
end-to-end exposition, security inheritance (access policy / rate
limit / basic-auth), the PII reject-list, exposition-format escaping
and the soft-cap warning. README.md updated with a "Metrics endpoint
(Prometheus)" subsection (config example, sample output, basic-auth
compatibility note, privacy/cardinality note, scrape_config snippet).
Refs: #851
* fix(ui-server): rebuild metrics registry on start to recover from stop/start cycle (M1)
The metrics registry was built only in the constructor and cleared on
stop(); after Bootstrap.stopUIServer() then startUIServer() (live-reload
path on uiServer.enabled toggle), metricsRegistry was undefined and
/metrics silently fell through to the JSON-RPC parser as a 400
Malformed URL with no log to indicate misconfiguration.
Move the registry build into start() guarded by an idempotency check
so the endpoint recovers across restart cycles. The constructor no
longer touches metricsRegistry.
Refs: #851 (Phase 4 review M1)
* feat(ui-server): expose metrics.softSampleCap in canonical defaults (M3)
The soft cardinality cap was documented as something operators may
'scale' but was not in the canonical defaults map. Add an optional
softSampleCap to UIServerMetricsConfigurationSchema (default 5000 via
METRICS_SOFT_SAMPLE_CAP) so operators can tune the threshold without
patching code, per AGENTS.md options-and-configuration rule.
Refs: #851 (Phase 4 review M3)
* fix(ui-server): serialize concurrent /metrics scrapes (M2 + m6 + m5)
Two concurrent GET /metrics requests interleaved on the shared
metricsSampleCount field and shared Gauge state, making the soft-cap
warning unreliable and the body content racy. Serialize scrapes via a
per-server promise chain; stop() now awaits the in-flight scrape
before clearing the registry, so a request mid-await no longer sees
undefined gauges.
Also adds the headersSent / writableEnded guard on the resolve path
so a client that closes mid-await does not trigger an emit-after-end.
The cap value is now read from uiServerConfiguration.metrics?.softSampleCap
(falls back to METRICS_SOFT_SAMPLE_CAP), wiring up the M3 schema field.
Refs: #851 (Phase 4 review M2 + m6 + m5)
* fix(ui-server): align iterateConnectors with simulator_station_connectors_total either-or (m2)
The per-connector iteration helper unconditionally yielded entries from
both data.connectors and data.evses, while simulator_station_connectors_total
uses an either-or rule. In current production these arrays are mutually
exclusive (buildConnectorEntries returns [] when hasEvses=true), so the
collision was unreachable, but the inconsistency made the helper fragile
to future invariant changes. Align it with the gauge's logic.
Refs: #851 (Phase 4 review m2)
* fix(ui-server): accept HEAD on /metrics for liveness-probe scrapers (n3)
Some Prometheus tooling HEAD-probes the endpoint before scraping.
Treat HEAD identically to GET in isMetricsRequest; Node drops the body
automatically on HEAD responses.
Refs: #851 (Phase 4 review n3)
* style: replace 'honoured' with 'honored' for spelling consistency (n2)
Aligns the new metrics docstring/log with American spelling used
elsewhere in the README and source.
Refs: #851 (Phase 4 review n2)
* docs(readme): fix metrics sample output to match actual exposition (m4)
The previous sample showed simulator_charging_stations_configured_total
with a 'template' label, but that metric is global with no labels.
The per-template counter is the separate simulator_template_configured
gauge. Replace the sample with the actual exposition shape.
Refs: #851 (Phase 4 review m4)
* test(ui-server): add wsState=undefined and EVSE-mode coverage (M4)
The new tests close two regression-detection gaps identified in Phase 4
review: (a) removing the !== undefined guard on data.wsState would
silently emit NaN — now caught by an absence assertion; (b) the OCPP
2.0.x code path (evses populated, connectors empty, evseStatus.connectors
Map iteration) was entirely untested — now exercised end-to-end.
Refs: #851 (Phase 4 review M4)
* test(ui-server): add soft-cap boundary tests for off-by-one detection (m3)
Probe-then-verify pattern: first scrape with a very high cap to count
actual samples produced; then assert no warn fires when cap equals the
sample count (strict-greater-than semantics) and one warn fires when
cap is one below. This regression test would fail if '>' became '>='
on the soft-cap comparison.
Refs: #851 (Phase 4 review m3)
* test(ui-server): retarget rate-limit burst at allowed loopback path (m1)
T9 previously hit a non-loopback denied path, exercising the rate
limiter on a 403 response rather than on /metrics. Switch to a
loopback request that is allowed through the access policy and assert
that the rate limit fires on /metrics directly.
Refs: #851 (Phase 4 review m1)
* style(test): rename tests per TEST_STYLE_GUIDE and drop redundant assertion (n1, n4)
Renamed all tests from 'T<N>: <verb>...' to 'should <verb>...' format
per tests/TEST_STYLE_GUIDE.md. Also dropped the redundant pre-stop
assertion in 'should clear registry on stop()' (the post-stop
assertion already proves the stop() effect).
Refs: #851 (Phase 4 review n1, n4)
* refactor(ui-server): add HEAD to HttpMethod enum (n1)
Avoid the bare 'HEAD' string literal in UIHttpServer.isMetricsRequest by
extending HttpMethod with an explicit HEAD member, matching the AGENTS.md
guidance to prefer enumerations over string literals when one exists.
* refactor(ui-server): adopt prom-client canonical collect pattern (M1+m1+m2+m3+n2+n3+n4+n5+n6)
Phase 6 review feedback consolidated into one coherent surgery on the
metrics path:
- M1: defineGauge now returns Gauge<L> with auto-injected registers and a
string-literal label-name generic. Every collect() callback is non-arrow
with typed 'this: Gauge<L>', per the prom-client documentation. Eliminates
~20 unsafe 'as Gauge | undefined' casts and ~30 dead null-guards.
- m1: collapse the four global aggregate gauges into a single tuple-loop,
mirroring the per-template loop pattern.
- m2: append a terminal '.catch(() => undefined)' to the metricsScrapeChain
reassignment in stop() so the field always points to a handled promise.
- m3: document the metricsSampleCount/metricsScrapeChain concurrency
invariants; explicitly forbid async collect callbacks.
- n2: factor 'ChargingStationDataProvider' type alias at module scope.
- n3: drop the redundant outer 'async' on handleMetricsRequest.
- n4: replace box-drawing dividers with plain JSDoc section markers.
- n5: rename 'PII whitelist invariant' to 'PII allowlist invariant'.
- n6: document the OCPP either-or rule on iterateConnectors.
The HTTP /metrics contract is unchanged; existing tests pass without
modification.
* docs(readme): align metrics documentation with code (m4, m5)
- m4: replace the two stale '# HELP' lines in the sample exposition block
with the strings actually emitted by the registry (commit
1bac5c23d
fixed two of the four; the other two had drifted again).
- m5: add the 'metrics' sub-key to the uiServer row of the configuration
table: bullet under the Description column documenting metrics.enabled
and metrics.softSampleCap, and corresponding shape in the Value type
column. AGENTS.md 'Documentation conventions / Exhaustivity' treats the
table as the authoritative tunable list.
* test(ui-server): regression for concurrent /metrics scrapes (R1, R2)
Locks the metricsScrapeChain serialization invariant against future drift:
- R1: two concurrent scrapes against a registry whose sample count equals
the configured softSampleCap; honest serialized execution produces zero
warns. A regression that removes the chain (or makes any collect()
callback async) would either spuriously warn (counter shared between
scrapes) or corrupt the body, both detected by this test.
- R2: same test indirectly guards against async collect callbacks; the
invariant is also documented as a JSDoc on buildMetricsRegistry.
* refactor(ui-server): apply Phase 7 polish (M1+n2+n3+n4+n5+n6+n8+n9)
- M1: collapse 3 inline per-station gauges (ws_state, connectors_total,
boot_status_info) into existing helpers (~40 LOC less duplication).
- n2: simulator_station_connectors_total now counts via iterateConnectors,
making the iterateConnectors JSDoc 'shared with' claim literally true.
- n3: replace 'const self = this' with a typed 'provider:
IChargingStationDataProvider' built from .bind(this) method captures.
Drops the @typescript-eslint/no-this-alias suppression and narrows the
type surface accessed by collect() callbacks.
- n4: justify defineGauge<L extends string = never> default in the JSDoc
(stricter than prom-client's 'Gauge<T extends string = string>').
- n5: addPerStationInfoLabel hard-codes 'status' (all callers used it);
addConnectorOneHot narrows labelName to a literal union of valid keys —
both eliminate the theoretical 'hash_id'/'connector_id' clobber risk.
- n6: rename ChargingStationDataProvider to IChargingStationDataProvider
(interface form, repo I-prefix convention) and extend with
getChargingStationsCount.
- n8: tighten handleMetricsRequest and iterateConnectors @param
descriptions to remove low-signal restatements.
- n9: drop the explanatory '// Explicit return required by
promise/always-return lint rule' comment; the lint rule speaks for
itself when the line is removed.
* test(ui-server): tighten concurrent-scrape regression with sample-count assertion (n1)
Replace the weak 'bodyA === bodyB' check (which would pass even if the
metricsScrapeChain serialization were removed, since both scrapes
collect against the same Registry instance) with an exact sample-line
count assertion against the probed value. Locks two distinct invariants
that bodyA===bodyB did not: no truncation, and no double-count from
interleaved 'collect()' calls under prom-client's internal Promise.all.
* docs(readme): add trailing semicolon to metrics value type (n7)
Align the new 'metrics?: { ... }' member of the uiServer Value type
column with the surrounding style (every other nested-object member
in the same type literal terminates with ';').
* [autofix.ci] apply automated fixes
* refactor(ui-server): apply Phase 8 NITs (n2+n3+n4+n5+n7; n1 deferred)
- n2: rename addPerStationInfoLabel to addPerStationStatusInfo to match
the helper's actual responsibility (it hard-codes the 'status' label
per Phase 7 n5).
- n3: drop the I-prefix on ChargingStationDataProvider — most interfaces
in src/ are unprefixed, only IBootstrap uses I.
- n4: extend the ChargingStationDataProvider JSDoc to acknowledge that
the inline simulator_ui_server_known_stations_total gauge consumes the
getChargingStationsCount method (not only the helpers).
- n5: extract countConnectors as a single source of truth for the OCPP
either-or rule and have simulator_station_connectors_total consume it.
Also restructure the iterateConnectors JSDoc to cross-reference
countConnectors instead of the gauge.
- n7: tighten @param res on handleMetricsRequest to 'HTTP response to
end with the exposition body.' (restores the direction Phase 7 n8
shortened away).
- n1 (addConnectorOneHot generic threading) is deferred: TypeScript
cannot narrow the dynamic computed property '[labelName]: v' to
'Record<L, string>' without an 'as' cast, which AGENTS.md 'Type
safety' forbids. The runtime literal-union on labelName is kept as
the type-safety contract; a 3-line comment documents the trade-off.
* refactor(ui-server): apply Phase 9 NITs (n1+n2+n3)
- n1: thread Gauge<...> end-to-end on addConnectorOneHot via a
ConnectorOneHotLabel type + typed-init + property mutation pattern.
Phase 9 oracle B proved the cast-free pattern compiles cleanly under
strict TS 6.0.3 (probe with 8 alternatives, only this one passes).
Replaces the Phase 8 deferral comment, which mis-attributed an 'as'
ban to AGENTS.md (AGENTS.md only forbids '!' and 'any').
- n2: drop the self-{@link} on countConnectors JSDoc opening — the
block documents countConnectors itself, so the cross-reference back
to it reads awkwardly. Asymmetric pattern with iterateConnectors
preserved.
- n3: drop the duplicate '(see {@link countConnectors} for the
invariant source)' parenthetical on iterateConnectors JSDoc — the
leading 'same...rule as {@link countConnectors}' already directs the
reader, and 'invariant source' was jargon.
* docs(ui-server): harmonize JSDoc prose with repo cadence
Three Phase 9-pass-2 audit findings against the rest of the codebase:
- Drop the coined phrases 'either-or rule' and 'OCPP-version-driven'
from countConnectors and iterateConnectors JSDoc; the repo-wide
prose simply names 'OCPP 1.6' / 'OCPP 2.0.x' inline (see ChargingStation.ts
'$OCPP 2.0.1 §4.2.3', Helpers.ts 'OCPP 2.0 chargingSchedule',
TemplateSchema.ts 'OCPP 2.0.1 §7.2'). Use 'mutually exclusive' for the
source-split semantic.
- Drop the bold-stress '**Invariant**:' prose markers from the
metricsSampleCount, stop() and buildMetricsRegistry JSDoc; `grep -rn '\*\*'
src/` returns 0 such markers in JSDoc anywhere else. Inline the
invariant prose into the surrounding sentences without a section
marker.
* docs(test): harmonize 'soft cap' spelling and @description cadence
Two outliers found in the test file's prose during repo-wide audit:
- 'soft-cap' (3 occurrences: @description, 1 test name, 1 inline comment)
was hyphenated while the production warning string and ConfigurationSchema
JSDoc both spell it 'soft cap' / 'soft sample cap' (no hyphen). Tests'
string-includes('soft cap') matches were already correct; only the prose
drifted.
- @description was multi-line whereas every other test file in tests/
uses a single-line @description (verified across ChargingStation*.test.ts,
ConfigurationKeyUtils.test.ts, TemplateValidation.test.ts, etc.).
* docs(ui-server): harmonize misconfiguration warning style with repo cadence
The warnIfMisconfigured warning used an em-dash + uppercase 'NOT' to
separate the condition from the consequence:
metrics.enabled=true is only honored when uiServer.type='http';
current type='X' — the /metrics endpoint will NOT be served.
Sibling warnings in AbstractUIServer (host-not-allowed and tls-required
at lines 441-457) use semicolon-and-period separators with normal-case
prose and a final action sentence. Match that cadence:
metrics.enabled=true is honored only when uiServer.type='http'; current
type='X'. The /metrics endpoint will not be served. Set uiServer.type='http'
to expose metrics.
Test substring matches ('metrics' / 'http') are preserved.
* test(ui-server): rename M-prefix fixtures to T-prefix convention
The fixture station hash IDs 'station-M{evse,boundary,concurrent}' inherited
the 'M' prefix from the Phase 4 review's MAJOR finding IDs (M1..M4) and
were tokenized by cspell as unknown words 'Mevse', 'Mboundary',
'Mconcurrent' — emitting 7 lint warnings.
The rest of the metrics test file already follows the 'station-T<n>'
numeric convention ('station-T5', 'station-T12', 'station-T13',
'station-T16'), where <n> matches the test ordinal. Map the M-prefixed
fixtures to their actual test ordinals:
- 'Mevse' -> 'T18' (EVSE-mode test, the 18th 'await it()')
- 'Mboundary-*' -> 'T19-*' (soft-cap boundary, 19th)
- 'Mconcurrent-*'-> 'T20-*' (concurrent-scrape regression, 20th)
Quality gates now report 0 errors AND 0 warnings; previous baseline
was 0 errors / 7 warnings.
* docs(readme): drop redundant Metrics endpoint section, harmonize uiServer.metrics bullet
The dedicated '### Metrics endpoint (Prometheus)' section duplicated
information that is already authoritative elsewhere:
- The /metrics behavior, configuration shape and defaults are documented
in the uiServer row of the configuration table (single source of
truth, per AGENTS.md 'No duplication' rule).
- The PII reject-list and softSampleCap semantics are documented in
UIServerMetricsConfigurationSchema JSDoc.
- The HTTP-only constraint and the warning behavior are documented in
the AbstractUIServer.warnIfMisconfigured warning string itself.
Drop the section, the TOC entry, and the now-dangling cross-link from
the table bullet. Restructure the _metrics_ bullet to mirror the
_accessPolicy_ cadence in the same row (top-level intro + indented
sub-bullets per sub-key), so the table description is self-sufficient
and harmonized with its sibling.
* chore(gitignore): ignore .codegraph alongside .omo/
.codegraph is a symlink to .omo/codegraph/projects/<hash>/ used by the
oh-my-opencode tooling, on the same lifecycle as .omo/ and .sisyphus/.
Group it under the existing 'oh-my-opencode' section.
---------
Co-authored-by: autofix-ci[bot] <114827586+autofix-ci[bot]@users.noreply.github.com>