Multidimensional (Fleet) Metrics
Some metrics aren't a single line — they're a fleet. http.requests.per2m isn't one
series, it's one series per instance: mbl-web-01, mbl-web-02, … mbl-web-40. A
threshold that fits the whole fleet is wrong for every member, and forty separate rules is
unmaintainable. Multidimensional metrics address this with two complementary fleet
analyses: peer deviation (is this member behaving unlike its peers right now?) and
baseline deviation (is this member behaving unlike its own history?). One host at 89%
CPU in a farm idling at ~42% is a peer outlier even if 89% is normal for that host on
other days — and a host quietly doubling its own usual level is a baseline outlier even
when the whole fleet drifts with it.
Single vs. multidimensional
A metric definition declares a metricMode:
single— one series per metric key. The classic case.multidimensional— many series under one key, separated by label values. The distinguishing labels are the dimension keys (for exampleinstance).
A multidimensional metric also declares its analysis contract:
- Analyses — which detectors run:
peer_deviation,baseline_deviation, or both. - Split dimension — the label key that identifies one member (for example
host.id), defaulting to the first dimension key. - Scope — an optional label filter that narrows the fleet (for example
{"host.group": "mobil-web"}). - A minimum peer count (don't bother comparing if there are too few members to form a meaningful peer group) and a max cardinality guard (protect against label explosions turning one metric into millions of series).
- Optional peer and baseline configuration documents tuning each detector's sensitivity.
Peer analysis: the core idea
Instead of asking "is this instance above a threshold?", fleet analysis asks "is this instance behaving unlike its peers right now?"
For a given timestamp, Lumetry takes the current value of every dimension in the fleet and computes the peer median. Each member is then scored against that median using a robust z-score (a median-based measure that isn't dragged around by a few extreme outliers — exactly the situation you have when one instance is the problem). A member is flagged as deviating when its robust z-score crosses the configured bound.
This catches the failure mode static thresholds miss: one node in a fleet quietly drifting away from the pack while still technically "within range."
Baseline analysis: judged against yourself
Peer analysis answers "unlike the others?"; baseline analysis answers "unlike yourself?". When enabled, Lumetry maintains a seasonal baseline per fleet member — the expected value and spread for that member at that minute of the week — and flags a member whose current value falls outside its own expected band. The two detectors cover each other's blind spots: a brand-new member has no history yet (peers catch it), and a fleet drifting together has no peer outlier (each member's own baseline catches it). Members with too little clean history simply stay silent until their baseline matures.
Static thresholds per member
Ordinary static threshold rules also work on multidimensional metrics. The rule is written once, and Lumetry evaluates it against every fleet member independently: each host (or each disk mount, when the member identity combines host and mount) gets its own violation history, its own alert, and its own recovery. A level's sustain duration is honored per member. This is how the out-of-box host alerts — for example disk usage per directory mount — alert on exactly the member that crossed the line instead of blending a fleet into one signal.
When you want a static rule to watch only part of the fleet, give it a scope — a label
filter on the rule itself. {"host.id": "srv-123"} targets a single server and
{"host.group": "prod-db"} a single group; an empty scope (the default) keeps watching every
member. The rule's scope is independent of the metric-level scope and only narrows which
members that one rule evaluates.
The rule can also override its grouping with Split By. Selecting labels keeps one series per selected label combination. Leaving Split By empty merges the scoped values into one minute-averaged fleet series. Existing rules that do not carry an override keep the metric definition's default member identity. Host series are shown by hostname in the Rule Builder, while Lumetry keeps a stable host identifier internally so renames do not break rule history.
The fleet view
For a multidimensional metric, Lumetry produces a fleet view summarizing the whole group at a point in time:
| Field | Meaning |
|---|---|
metricKey, displayName, unit | Identity of the fleet metric. |
dimensionCount | How many members (series) are in the fleet. |
peerMedian | The median value across all members. |
min, max | The spread across the fleet. |
deviatingCount | How many members are currently flagged as deviating. |
dimensions[] | Per-member detail (see below). |
Each entry in dimensions[] describes one member:
| Field | Meaning |
|---|---|
dimensionKey | The member's identity within the fleet. |
labels | The label set that distinguishes it (e.g. { "host.id": "mbl-web-07" }). |
value | The member's current value. |
peerMedian | The fleet median it is compared against. |
deviationPercent | How far it sits from the median, in percent. |
robustZScore | The robust deviation score. |
isDeviating | Whether it is currently flagged. |
severity, direction | Severity of the deviation and whether it is above/below peers. |
The fleet view is what turns "the mbl-web fleet looks fine on average" into "38 nodes are normal, 2 are running hot — here they are."
Why this matters operationally
- One metric, not forty rules. You define and reason about the fleet once.
- Outlier detection, not averages. Averages hide the one bad node; peer analysis surfaces it.
- Self-calibrating. As the fleet's normal level drifts, the peer median drifts with it — there is no fixed threshold to keep re-tuning.
- Two questions, one definition. "Unlike its peers" and "unlike its own history" are both asked of the same data, and both kinds of finding collapse into one operational incident instead of an alert storm.
Related API
- Metrics & Catalog API — the fleet view endpoint and the multidimensional fields on a metric definition.