Skip to main content

Dynamic Thresholds & Baselines

Static thresholds are simple but they lie: a value that is perfectly healthy at 3pm can be an emergency at 3am, and vice versa. Dynamic thresholds solve this by comparing each point against what is normal for that time, learned from history. This page explains the baseline Lumetry learns and the three dynamic-threshold modes that turn it into limits.

The seasonal baseline

Most operational metrics are seasonal: they follow daily and weekly patterns — business-hours ramps, overnight quiet, weekday/weekend differences. A single average across all of history would be useless for such a metric.

Lumetry models this with a minute-of-week baseline. The week is divided into buckets keyed by minute of the week (so 09:00 on Monday is a different bucket from 09:00 on Sunday), and for each bucket Lumetry maintains running statistics:

  • sample count
  • sum and sum of squares (which together yield mean and standard deviation)
  • historical min and max

From these, each bucket produces an expected value (the running mean), a standard deviation, and a historical min/max envelope. These are the raw materials every dynamic mode draws on.

How the baseline is maintained

  • Baselines are computed incrementally. After an initial bootstrap from a lookback window, later cycles read only the new history since the last computation and fold it into the running statistics — so the cost is proportional to new data, not to all of history.
  • A baseline profile holds the reusable settings (lookback window, minimum points) and can be shared across rules.

Contamination protection

A baseline must learn normal, so it must not learn from abnormal. Lumetry deliberately excludes known-bad periods from baseline math:

  • Incident windows — operator-declared periods of known abnormality (maintenance, a past outage) are excluded.
  • Prior violations — points that already breached are not folded back into the expectation.

If, after exclusions, there are not enough clean samples to compute a trustworthy baseline, Lumetry keeps the previous baseline and marks it stale rather than overwriting it with thin or noisy data. A dynamic rule therefore degrades gracefully instead of suddenly mis-firing. Each violation also carries a baseline-confidence indicator so low-confidence detections are visible downstream.

The three dynamic-threshold modes

A dynamic rule selects one threshold mode. All three derive a band from the same baseline; they differ in how the band is shaped. The modes are mutually exclusive — only the parameter relevant to the chosen mode is used.

Percentage

The band is a percentage tolerance around the expected value.

upper = expected × (1 + tolerancePercent/100)
lower = expected × (1 − tolerancePercent/100)

Use it when you think in relative terms — "alert if traffic is more than 30% off its usual level for this time of week." Driven by tolerancePercent. Ignores stddevMultiplier.

Stddev

The band is a multiple of the bucket's standard deviation around the expected value.

upper = expected + (stddevMultiplier × stddev)
lower = expected − (stddevMultiplier × stddev)

Use it when you want the band to adapt to how volatile the metric naturally is — a noisy metric gets a wider band, a steady one a tighter band, automatically. Driven by stddevMultiplier. Ignores tolerancePercent.

Envelope

The band is a tolerance around the historical min/max seen for that bucket.

upper = historicalMax × (1 + tolerancePercent/100)
lower = historicalMin × (1 − tolerancePercent/100)

Use it when you care about breaking out of the historically observed range rather than deviating from the mean — useful for metrics with a wide-but-bounded normal range. Driven by tolerancePercent.

Choosing a mode

You want to detect…Mode
Relative deviation from the usual levelPercentage
Statistically unusual behavior, auto-scaled to the metric's own noiseStddev
Breaking out of the historically observed rangeEnvelope

Direction

Every dynamic (and static) rule has a direction: Above, Below, or Both. A latency metric usually only cares about Above; a throughput or success-count metric often cares about Below (something stopped); Both catches deviation in either direction.

Static thresholds

Static rules skip the baseline entirely and compare against fixed staticUpper / staticLower limits. Alert levels can still define stricter per-level thresholds for Warning/Critical escalation. They remain the right tool when a hard SLA or physical limit exists — "disk must never exceed 95%" — where "normal for the time of week" is irrelevant.

A healthy deployment typically mixes both: static rules for hard limits and SLAs, dynamic rules for everything whose normal range moves with time.

  • Rules API — including the preview endpoint for tuning a baseline band against real history.