Recurring failures (clusters)

Every CI failure that happens more than once gets clustered automatically. See it in the dashboard, query it via the API, and let cost-saver mode skip duplicate LLM calls.

Exlogare automatically groups failures that look the same into failure clusters. Each cluster represents one recurring issue across your team’s CI, with a count, first/last seen timestamps, and a status you can drive from the dashboard or the API.

Clusters are available on every plan. Cost-saver (which reuses the previous analysis instead of re-invoking the LLM on a duplicate) is on by default.

Why clustering matters

CI failures usually arrive in bursts: one flaky test fails on five runs in a row, one broken Docker image breaks every build, one bad migration blocks the team. Without clustering, the dashboard shows you five rows that all say the same thing. With clustering, you see them grouped into a single recurring issue with a “×5” badge.

Three direct benefits:

Less noise. The Analyses page shows a small ×N chip next to the root cause when a failure has happened more than once. Acknowledged clusters hide the chip so the team isn’t spammed during a known incident.
Regression detection. When a cluster you marked resolved recurs, we automatically flip it back to active — so a regression isn’t silently buried in your “fixed” list.
Cost-saver mode. If the same failure recurs within 6 hours, the next ingest reuses the previous analysis instead of paying for a fresh LLM call. The reused row is non-billable telemetry on the Usage page.

How clustering works

Every successful analysis writes a row to the cluster table keyed by two related hashes:

lookup_hash — computed before the LLM call. sha1(version | provider | service_tag | normalised_log_tail). Cost-saver uses this to ask “have I just seen this exact log shape on this CI provider for this underlying service?” — and if so, reuses the prior analysis.
fingerprint_hash — the cluster row’s identity, computed after the LLM. sha1(lookup_hash | severity). Each cluster is unique on (tenant, fingerprint_hash).

The hash is shape-aware: timestamps, UUIDs, build numbers, hex blobs, and temp paths are masked out before hashing, so “build #4321 timed out” and “build #4322 timed out” cluster together but “OOM in pytest” and “connection refused” do not.

Each ingest:

Cleans the log, derives the CI provider and a service_tag, computes lookup_hash.
Looks up the most recent cluster matching lookup_hash for this tenant.
If a cluster exists with count > 1 and last_seen_at within COST_SAVER_TTL_HOURS (default 6h), reuses the prior analysis (severity and the entire RCA come from it).
Otherwise runs the LLM, persists the analysis, computes fingerprint_hash = sha1(lookup_hash | severity), and UPSERTs the cluster on that key.
Either way, fires messenger notifications and outbound webhooks for this run.

Cluster lifecycle

A cluster always has one of three statuses:

Status	Meaning
`active`	Open, recurring. Shows up on the dashboard and triggers the recurring badge.
`acknowledged`	”We know, working on it.” Recurring badges hide so you aren’t spammed; counters keep ticking.
`resolved`	Marked fixed. A new occurrence flips status back to `active` automatically (regression detection).

You can transition a cluster from the Recurring tab in the dashboard, or via the API.

Dashboard

The Recurring tab in the sidebar shows three sub-tabs (active / acknowledged / resolved) with a count badge each. Each row links to the most recent analysis. Inline buttons let you acknowledge, resolve, or reopen.

The Analyses page renders the recurring chip (×N) next to the root cause for any failure whose cluster has more than one occurrence and is not acknowledged. Click the chip to jump to the Recurring tab.

API

All cluster endpoints live under /api/v1/clusters and require an API token with scope=read.

List clusters

curl "https://app.exlogare.com/api/v1/clusters?status=active&limit=50" \
  -H "Authorization: Bearer exl_…"

Query parameters:

status — one of active (default tab), acknowledged, resolved. Omit to see all.
limit — 1..500 (default 100).
offset — pagination cursor.

Response:

{
  "items": [
    {
      "id": "9f7c…",
      "fingerprint_hash": "5d0e71c2…",
      "last_root_cause": "Connection refused on 5432",
      "last_severity": "high",
      "count": 7,
      "first_seen_at": "2026-04-20T10:00:00+00:00",
      "last_seen_at":  "2026-04-25T16:42:00+00:00",
      "status": "active",
      "last_analysis_id": "abc1…",
      "acknowledged_at": null,
      "resolved_at": null
    }
  ],
  "total": 1,
  "limit": 50,
  "offset": 0
}

Get a single cluster

curl "https://app.exlogare.com/api/v1/clusters/9f7c…" \
  -H "Authorization: Bearer exl_…"

Returns the same shape as one entry in the list.

Status transitions

Acknowledge / resolve / reopen are session-only (cookie or admin JWT) so they live under /api/clusters/... rather than the public /api/v1/... surface. We don’t expose them to API tokens on purpose: an ingest-token-equivalent flipping statuses is the wrong blast radius.

Cost-saver mode

Cost-saver is on by default for every tenant. When a duplicate failure arrives within COST_SAVER_TTL_HOURS (default 6 hours), Exlogare:

Skips the LLM call.
Reuses the previous analysis (same root cause, same fix suggestion).
Persists the ingestion event so dashboards stay accurate.
Records a clustered_reuse usage event — visible on the Usage page as “saved by cost-saver” but not billed against your plan or prepaid pool.
Still fires messenger notifications and outbound webhooks (you want to know the failure recurred).

To turn cost-saver off for your tenant, contact support — there’s no UI toggle yet because the default is correct for every customer we’ve talked to.

What collides and what doesn’t

Two failures cluster together when, after normalization, the cleaned log shape matches and the CI provider plus service_tag match. Normalization masks:

Timestamps (2026-04-25T10:00:00Z → <TS>).
UUIDs (123e4567-… → <UUID>).
Hex blobs ≥ 8 characters (commit hashes, build IDs).
Temp paths (/tmp/…).
Long numbers (build IDs, ports, durations).

What is not masked, so it keeps clusters apart:

Service / database names: postgres, mssql, mysql, redis, kafka, etc. — words survive normalization, so a Postgres timeout and an MSSQL timeout produce different lookup_hash values.
The CI provider itself: an identical-looking log from Jenkins and from CircleCI lands in different clusters.
LLM-assigned severity: the same lookup_hash with a different severity (medium → high) yields a different cluster. This is intentional — when a flaky test escalates into a real incident, the operator should see a new cluster, not a silent severity bump on the old one.

Example: Postgres timeout vs MSSQL timeout

Two superficially similar errors:

build #4321: psycopg2.OperationalError: timeout expired connecting to db.prod.local:5432
build #4322: pyodbc.OperationalError: Login timeout expired (mssql at sqlserver.prod.local:1433)

Both go through normalization (4321/4322/5432/1433 are masked to <N>), but the service detector spots psycopg in the first log → service_tag=postgres, and pyodbc/mssql in the second → service_tag=mssql. Different lookup_hash → two distinct clusters, each with its own counter and history.

If both were Postgres timeouts but the LLM tagged the first medium (a flaky retry) and the second high (a confirmed outage after several retries), lookup_hash would match but fingerprint_hash would differ — again two separate clusters: an unresolved flake and an active incident.

Auto-detected `service_tag` keywords

The detector inspects the last 2000 characters of the log (where the actual error usually lives) and picks the first matching tag from this ordered list (most specific first):

Databases: postgres, mssql, oracle, mysql, mongodb, redis, elasticsearch.
Brokers: kafka, rabbitmq.
Containers / orchestration: kubernetes, docker.
Package managers: npm, yarn, pip, poetry, gradle, maven, cargo, go-modules.
Test frameworks: pytest, jest, junit, rspec, go-test.

If nothing matches, service_tag is empty and clustering falls back to log shape + provider only. The empty-tag bucket is its own bucket, so an unrecognized stack still clusters with itself across runs.

What the fingerprint can still miss

Very long logs (> 4000 normalised characters). Only the last 4000 characters of the normalised log are hashed. If the discriminating detail lives near the start of the log (a service name in the setup section) and the tail looks generic (a stack trace common to many bugs), the lookup_hash may collide. Rare in practice — the failing step is almost always near the end.
Unrecognized service. When nothing in the tail matches the keyword list, service_tag is empty. Two genuinely different failures with no recognizable infrastructure mention will share a cluster. The trade-off favours stable clustering; an explicit resolved on the misclustered row rebuilds it on the next occurrence.

Two genuinely different errors keep their own clusters. If the analyzer ever returns the same root cause for two unrelated failures, mark one cluster resolved — the next recurrence rebuilds the row.

Privacy

Clusters live in the same tenant boundary as everything else: rows from other tenants are never returned, even if their fingerprints collide by coincidence. The fingerprint hash is computed from the cleaned log and stored alongside the analysis (which itself never persists raw log content — see data privacy for the full invariants).