Alarm Graph Fallback — 2026-07-01T19:45Z #78

Open
opened 2026-07-01 19:50:02 +00:00 by mhugo · 0 comments
Owner

Alarm Graph Fallback — 2026-07-01T19:45Z

Alert Batch: 37 Active Alerts

K8s Platform — REAL Incidents

1. DeploymentReplicasUnavailable (3 deployments) — Started 19:40Z

  • external-dns/external-dns → missing replicas (pod CrashLooping since 19:35Z)
  • flakecache/flakecache → missing replicas
  • operations-memory/operations-memory → missing replicas

2. CrashLooping pods (2) — 19:35Z

  • external-dns/external-dns-697d9488d7-2hb64 → CrashLoopBackOff
  • repowise/repowise-569c79c458-x6hhw → CrashLoopBackOff

3. CiliumCanaryProbeFailing (2) — 19:22Z

  • cc-se-sto-core-01 unreachable from cc-fr-lau-store-02 (lau1→sto1) AND cc-de-fsn-core-01 (fsn1→sto1)
  • Different source sites, same target → network partition to sto1 node

4. NixosHostDeployFailed (3) — 19:30Z batch

  • cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-se-sto-core-01 all failed deploy at exactly 19:30Z
  • cc-se-sto-core-01 matches the Cilium canary target — likely the same root cause

5. DNSMasterDown (1) — 18:42Z

  • ns-master down 15m+, zone updates blocked

Chronic Noise (No Intervention Needed)

6. LonghornMaintenanceJobFailed (3) — PITFALL-252 (snapshot-purge-watchdog BackoffLimitExceeded)
7. KubernetesAgentBackupControlJobFailed (~13) — backup-audit / backup-label-reconciler BackoffLimitExceeded
8. SmokepingInterSiteLatencyHigh (3) — cc-fr-lau-store-01 chronic latency targets (89.167.50.230, 204.168.217.156, 49.13.125.237)
9. HeadscaleOperatorMetricsDown (1) — likely persistent (namespace/service removed)

Cascade Timeline

19:22Z → Cilium canary fails (cc-se-sto-core-01 unreachable)
19:30Z → 3 NixosHost deploy failures (cc-se-sto-core-01 included — likely same host)
19:35Z → external-dns + repowise pods crashloop
19:40Z → 3 deployments report replicas unavailable

Classification

  • Real cascade incident: Nixos deploy failure at sto1 (cc-se-sto-core-01) → network unreachable → downstream pods crash → deployments lose replicas
  • Chronic noise: Longhorn PITFALL-252, BackupControlJob, Smokeping chronic targets, Headscale operator metrics
  • Action needed: Investigate cc-se-sto-core-01 host health and Nixos deploy operator logs; check why external-dns and repowise are crashlooping

Notes

  • K8s MCP unreachable (68+ consecutive failures) — agent cannot query live API
  • /tmp directory missing from agent sandbox — execute_code blocked
  • Prometheus MCP unreachable (5+ failures)
# Alarm Graph Fallback — 2026-07-01T19:45Z ## Alert Batch: 37 Active Alerts ### K8s Platform — REAL Incidents **1. DeploymentReplicasUnavailable (3 deployments)** — Started 19:40Z - external-dns/external-dns → missing replicas (pod CrashLooping since 19:35Z) - flakecache/flakecache → missing replicas - operations-memory/operations-memory → missing replicas **2. CrashLooping pods (2)** — 19:35Z - external-dns/external-dns-697d9488d7-2hb64 → CrashLoopBackOff - repowise/repowise-569c79c458-x6hhw → CrashLoopBackOff **3. CiliumCanaryProbeFailing (2)** — 19:22Z - cc-se-sto-core-01 unreachable from cc-fr-lau-store-02 (lau1→sto1) AND cc-de-fsn-core-01 (fsn1→sto1) - Different source sites, same target → network partition to sto1 node **4. NixosHostDeployFailed (3)** — 19:30Z batch - cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-se-sto-core-01 all failed deploy at exactly 19:30Z - cc-se-sto-core-01 matches the Cilium canary target — likely the same root cause **5. DNSMasterDown (1)** — 18:42Z - ns-master down 15m+, zone updates blocked ### Chronic Noise (No Intervention Needed) **6. LonghornMaintenanceJobFailed (3)** — PITFALL-252 (snapshot-purge-watchdog BackoffLimitExceeded) **7. KubernetesAgentBackupControlJobFailed (~13)** — backup-audit / backup-label-reconciler BackoffLimitExceeded **8. SmokepingInterSiteLatencyHigh (3)** — cc-fr-lau-store-01 chronic latency targets (89.167.50.230, 204.168.217.156, 49.13.125.237) **9. HeadscaleOperatorMetricsDown (1)** — likely persistent (namespace/service removed) ### Cascade Timeline 19:22Z → Cilium canary fails (cc-se-sto-core-01 unreachable) 19:30Z → 3 NixosHost deploy failures (cc-se-sto-core-01 included — likely same host) 19:35Z → external-dns + repowise pods crashloop 19:40Z → 3 deployments report replicas unavailable ### Classification - **Real cascade incident**: Nixos deploy failure at sto1 (cc-se-sto-core-01) → network unreachable → downstream pods crash → deployments lose replicas - **Chronic noise**: Longhorn PITFALL-252, BackupControlJob, Smokeping chronic targets, Headscale operator metrics - **Action needed**: Investigate cc-se-sto-core-01 host health and Nixos deploy operator logs; check why external-dns and repowise are crashlooping ### Notes - K8s MCP unreachable (68+ consecutive failures) — agent cannot query live API - /tmp directory missing from agent sandbox — execute_code blocked - Prometheus MCP unreachable (5+ failures)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set

Reference
singularity/singularity-forge#78
No description provided.