Alarm Graph Fallback — 2026-07-01T19:45Z #78

New issue

Open

opened 2026-07-01 19:50:02 +00:00 by mhugo · 0 comments

mhugo commented

2026-07-01 19:50:02 +00:00

Owner

Alarm Graph Fallback — 2026-07-01T19:45Z

Alert Batch: 37 Active Alerts

K8s Platform — REAL Incidents

1. DeploymentReplicasUnavailable (3 deployments) — Started 19:40Z

external-dns/external-dns → missing replicas (pod CrashLooping since 19:35Z)
flakecache/flakecache → missing replicas
operations-memory/operations-memory → missing replicas

2. CrashLooping pods (2) — 19:35Z

external-dns/external-dns-697d9488d7-2hb64 → CrashLoopBackOff
repowise/repowise-569c79c458-x6hhw → CrashLoopBackOff

3. CiliumCanaryProbeFailing (2) — 19:22Z

cc-se-sto-core-01 unreachable from cc-fr-lau-store-02 (lau1→sto1) AND cc-de-fsn-core-01 (fsn1→sto1)
Different source sites, same target → network partition to sto1 node

4. NixosHostDeployFailed (3) — 19:30Z batch

cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-se-sto-core-01 all failed deploy at exactly 19:30Z
cc-se-sto-core-01 matches the Cilium canary target — likely the same root cause

5. DNSMasterDown (1) — 18:42Z

ns-master down 15m+, zone updates blocked

Chronic Noise (No Intervention Needed)

6. LonghornMaintenanceJobFailed (3) — PITFALL-252 (snapshot-purge-watchdog BackoffLimitExceeded)
7. KubernetesAgentBackupControlJobFailed (~13) — backup-audit / backup-label-reconciler BackoffLimitExceeded
8. SmokepingInterSiteLatencyHigh (3) — cc-fr-lau-store-01 chronic latency targets (89.167.50.230, 204.168.217.156, 49.13.125.237)
9. HeadscaleOperatorMetricsDown (1) — likely persistent (namespace/service removed)

Cascade Timeline

19:22Z → Cilium canary fails (cc-se-sto-core-01 unreachable)
19:30Z → 3 NixosHost deploy failures (cc-se-sto-core-01 included — likely same host)
19:35Z → external-dns + repowise pods crashloop
19:40Z → 3 deployments report replicas unavailable

Classification

Real cascade incident: Nixos deploy failure at sto1 (cc-se-sto-core-01) → network unreachable → downstream pods crash → deployments lose replicas
Chronic noise: Longhorn PITFALL-252, BackupControlJob, Smokeping chronic targets, Headscale operator metrics
Action needed: Investigate cc-se-sto-core-01 host health and Nixos deploy operator logs; check why external-dns and repowise are crashlooping

Notes

K8s MCP unreachable (68+ consecutive failures) — agent cannot query live API
/tmp directory missing from agent sandbox — execute_code blocked
Prometheus MCP unreachable (5+ failures)

# Alarm Graph Fallback — 2026-07-01T19:45Z ## Alert Batch: 37 Active Alerts ### K8s Platform — REAL Incidents **1. DeploymentReplicasUnavailable (3 deployments)** — Started 19:40Z - external-dns/external-dns → missing replicas (pod CrashLooping since 19:35Z) - flakecache/flakecache → missing replicas - operations-memory/operations-memory → missing replicas **2. CrashLooping pods (2)** — 19:35Z - external-dns/external-dns-697d9488d7-2hb64 → CrashLoopBackOff - repowise/repowise-569c79c458-x6hhw → CrashLoopBackOff **3. CiliumCanaryProbeFailing (2)** — 19:22Z - cc-se-sto-core-01 unreachable from cc-fr-lau-store-02 (lau1→sto1) AND cc-de-fsn-core-01 (fsn1→sto1) - Different source sites, same target → network partition to sto1 node **4. NixosHostDeployFailed (3)** — 19:30Z batch - cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-se-sto-core-01 all failed deploy at exactly 19:30Z - cc-se-sto-core-01 matches the Cilium canary target — likely the same root cause **5. DNSMasterDown (1)** — 18:42Z - ns-master down 15m+, zone updates blocked ### Chronic Noise (No Intervention Needed) **6. LonghornMaintenanceJobFailed (3)** — PITFALL-252 (snapshot-purge-watchdog BackoffLimitExceeded) **7. KubernetesAgentBackupControlJobFailed (~13)** — backup-audit / backup-label-reconciler BackoffLimitExceeded **8. SmokepingInterSiteLatencyHigh (3)** — cc-fr-lau-store-01 chronic latency targets (89.167.50.230, 204.168.217.156, 49.13.125.237) **9. HeadscaleOperatorMetricsDown (1)** — likely persistent (namespace/service removed) ### Cascade Timeline 19:22Z → Cilium canary fails (cc-se-sto-core-01 unreachable) 19:30Z → 3 NixosHost deploy failures (cc-se-sto-core-01 included — likely same host) 19:35Z → external-dns + repowise pods crashloop 19:40Z → 3 deployments report replicas unavailable ### Classification - **Real cascade incident**: Nixos deploy failure at sto1 (cc-se-sto-core-01) → network unreachable → downstream pods crash → deployments lose replicas - **Chronic noise**: Longhorn PITFALL-252, BackupControlJob, Smokeping chronic targets, Headscale operator metrics - **Action needed**: Investigate cc-se-sto-core-01 host health and Nixos deploy operator logs; check why external-dns and repowise are crashlooping ### Notes - K8s MCP unreachable (68+ consecutive failures) — agent cannot query live API - /tmp directory missing from agent sandbox — execute_code blocked - Prometheus MCP unreachable (5+ failures)

mhugo referenced this issue

2026-07-01 20:16:00 +00:00

Platform incident 2026-07-01T19:30Z — NixosHostDeployFailed cascade + external-dns flag parsing crash #79

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set

Reference

singularity/singularity-forge#78

No description provided.

Rows
Columns