Violations, Alerts & Incidents
Lumetry deliberately separates detection into three layers. Each layer answers a different question for a different audience, and the separation is what lets the platform turn a flood of raw breaches into a short, contextualized list an on-call engineer can actually act on. This is the core of the product.
many fewer fewest
┌──────────┐ trigger ┌──────────┐ correlation ┌──────────┐
│VIOLATIONS│──condition──▶ │ ALERTS │──by service/CI─▶│INCIDENTS │
│ (points) │ │(lifecycle)│ + time window │ (grouped)│
└──────────┘ └──────────┘ └──────────┘
forensic on-call reacts incident response
Layer 1 — Violations (raw signal)
A violation is a single point-level breach of a rule's threshold. It is forensic evidence, not a call to action.
A violation records: the rule and metric, the timestamp, the actual value, the expected value and the lower/upper thresholds in force at that moment, the direction of the breach, an optional severity and alert-level reference, a deviation percent, and a baseline-confidence indicator. For multidimensional metrics, it also records the affected dimension key, such as a host/mount member.
You use violations to investigate: to see exactly which points breached, by how much, and against what expectation. There can be thousands of them. Nobody is paged for a violation.
Layer 2 — Alerts (operational lifecycle)
An alert is what an operator reacts to. Lumetry can open it when violations cross a rule's trigger condition, when a fleet detector finds a deviating member, or when an external monitoring source sends a firing transition. Metric alerts aggregate underlying violations; external alerts preserve the provider occurrence and topology mapping.
An alert has a lifecycle exposed as operationalStatus:
| Status | Meaning |
|---|---|
Open | The alert is live; the condition is currently met. |
Acknowledged | An operator has taken ownership but it is not yet resolved. |
Closed | The metric recovered for the recovery window, or an operator closed it. |
An alert carries the context an operator needs immediately: its name, severity, a trigger reason ("why did this open?"), the violation count it represents, its start/end time and duration, and — crucially — the related service and related CI resolved from topology. That last part is what turns "metric X is red" into "the Payments service, on host ANVMBWEBWX01, is degraded."
Open metric alerts are guarded so a given rule/metric/severity has only one open alert at a time for single-metric rules. For multidimensional metrics, each affected dimension can have its own open alert, so several mounts or interfaces can appear as separate alert rows while still being correlated into one incident. Recovery closes the specific alert when that metric or dimension is clean for the recovery window. External alerts are closed by the source's resolved transition and cannot be manually closed in Lumetry. Open and close transitions are delivered to notification integrations — generic webhooks plus native Slack, Microsoft Teams, PagerDuty, and ServiceNow channels (see Topology & CMDB for context resolution and Alerting API for routing and channel details).
Alerts are not incident aggregates. An alert is still about a single rule/metric. To see the bigger picture across many alerts, you look at incidents.
Layer 3 — Incidents (correlated picture)
An incident is a correlated group of alerts that represent one operational problem. When several alerts fire close together on the same service, you don't want five pages and five war rooms — you want one incident.
When a new alert opens, Lumetry correlates it into an incident by:
- same workspace, and
- same service / CI (falling back to the same metric when topology context is absent), and
- same severity, and
- within a configured correlation time window.
If a matching open incident exists, the alert joins it; otherwise a new incident is opened. An incident therefore groups related alerts rather than restating them.
An incident exposes:
| Field | Meaning |
|---|---|
incidentKey | Stable, human-readable identifier (e.g. INC-000001). |
primaryServiceName / primaryCiName / primaryMetricId | What the incident is centered on. |
severity, status | Open → Acknowledged → Resolved. |
startTime, endTime, durationSeconds | When it began, ended, and how long it ran. |
alertCount, activeAlertCount, violationCount | How many alerts it spans, how many linked alerts are still active, and how many underlying violations it spans. |
Resolving an incident records an operator note and a structured resolution classification
such as Resolved, NoImpact, FalsePositive, Maintenance, or AcceptedRisk.
If linked alerts are still active, Lumetry leaves those alerts active and records how many
were active at the moment of resolution. Incident resolution closes the coordination record;
it does not claim that every source alert has recovered. Those still-active alerts are marked
for post-resolution watch until they close, so they remain visible without automatically
creating a new incident immediately. If the condition persists, policy can reopen the
resolved incident within a configured horizon or create a follow-up incident after a
longer persistence threshold.
Maintenance windows are operational suppression controls, not closure controls. An alert covered by maintenance remains an alert, but incident generation and notifications are suppressed and the alert is marked as maintenance-suppressed. Maintenance does not automatically resolve incidents or close alerts.
Problem candidates are deterministic follow-up hints for recurring or persistent patterns such as repeated non-impact resolutions, reopened incidents, or long-running post-resolution watches. They support problem-management review without changing the incident lifecycle.
The Operations Control page lets operators create maintenance windows and review current problem candidates from the same operational workspace.
An incident also has a timeline of lifecycle and correlation events (alert joined, acknowledged, resolved, …), plus the ability to enumerate its constituent alerts, violations, and affected metrics. This is the object an incident-response process hangs off: one record, full context, full history.
How the three layers map to the questions you ask
| Question | Layer |
|---|---|
| "Exactly which points breached, and by how much?" | Violation |
| "Is something wrong right now that I should act on, and what does it affect?" | Alert |
| "What is the overall problem, across everything that's firing?" | Incident |
Operational time ranges
Lumetry's operational views support relative ranges such as the last 30 minutes, 1 hour, 2 hours, 6 hours, 24 hours, 3 days, and 7 days, plus custom absolute ranges. Relative ranges are evaluated from the current time each time the view is loaded or refreshed. Custom ranges stay fixed until you edit them.
The overview uses the selected range for activity and recent-history charts, while open alert and open incident counts represent currently open work rather than only records that started inside the selected range.
Incident windows ≠ incidents
A separate concept, the incident window, is an operator-declared period of known abnormality (a maintenance window, a past outage) used to exclude that period from baseline learning so dynamic thresholds aren't poisoned by it. Despite the name it is not part of the operational incident model above — it is a baseline-hygiene tool. See Dynamic Thresholds & Baselines.
This is distinct from an operational maintenance window, which suppresses incident generation and notifications for active alerts during planned work.