NixosBatchDeployCascade: 4 NixosHostDeployFailed → node failure → pod crashes, DNS out, backup failures (2026-07-01T19:22Z) #80

New issue

Open

opened 2026-07-01 20:16:12 +00:00 by mhugo · 0 comments

mhugo commented

2026-07-01 20:16:12 +00:00

Owner

Root Cause

Batch NixOS deploy failure across 4 hosts at ~19:22-19:30Z. All 4 NixosHostDeployFailed alerts started at 19:30Z:

cc-se-sto-core-01 (sto1) — k3s post-activation failure → NodeNotReady at 19:35Z → Cilium canary probes from 4 sites (nue1, hel1, fsn1, lau1) all failing to this target since 19:22-19:26Z
cc-fi-hel-core-01 (hel1) — deploy failed
cc-fi-hel-k3s-02 (hel1) — deploy failed (k3s node)
cc-fr-lau-store-01 (lau1) — deploy failed (Smokeping probe source)

Cascade Effects (confirmed from Alertmanager + Loki)

Node failure: cc-se-sto-core-01 NotReady, 4 Cilium canary probes converging on same target = complete node unreachability
Pod crashes: repowise (CrashLoopBackOff, apt-get permission denied from Loki logs — non-root container running apt at startup) and external-dns (flag parsing error "unexpected false" from Loki logs) — both on node cc-fi-hel-k3s-04
Deployment replica loss: external-dns, flakecache, operations-memory, repowise all missing replicas since 19:40Z
DNS outage: PowerDNS master (ns-master) down since 19:31Z — zone propagation blocked
Headscale operator: Service namespace gone (DNS "no such host" for headscale-operator-metrics.svc)
Backup failures: ~13 KubernetesAgentBackupControlJobFailed (BackoffLimitExceeded) since 19:30Z
Longhorn maintenance: 3x snapshot-purge-watchdog PITFALL-252 (chronic benign)
CNPG stale backup: flakecache-postgres >50h no base backup (critical)
Smokeping re-fire: cc-fr-lau-store-01 → 89.167.50.230 chronic (tick re-fire, benign)

Confidence Levels

HIGH: NixosHostDeployFailed (4 hosts, same timestamp), Cilium canary convergence (4 sites → same target), NodeNotReady
HIGH: External-dns crash loop (flag parsing error from Loki), repowise crash loop (apt-get permission denied from Loki)
HIGH: Headscale operator metrics = service namespace removed (DNS "no such host")
MEDIUM: CNPG stale backup — cascade from node failure
LOW: MailHeartbeatMissed — could be DNS master down causing ticket pipeline failure

Resolution Priority

NixosHostDeployFailed — operator SSH to diagnose each host's failure mode (different modes possible)
Node cc-se-sto-core-01 — k3s post-activation failure (Mode C) needs config rollback
external-dns — flag parsing error in deployment spec (need K8s API access to inspect)
repowise — non-root container apt-get at startup (Ref: container-startup-apt-get-permission-denied skill)
DNS master — PowerDNS master down, needs investigation

Environment Note

/tmp missing on agent — all file tools blocked (write_file, read_file, patch). /tmp must be created externally before cron jobs can execute.

## Root Cause Batch NixOS deploy failure across 4 hosts at ~19:22-19:30Z. All 4 NixosHostDeployFailed alerts started at 19:30Z: - **cc-se-sto-core-01** (sto1) — k3s post-activation failure → NodeNotReady at 19:35Z → Cilium canary probes from 4 sites (nue1, hel1, fsn1, lau1) all failing to this target since 19:22-19:26Z - **cc-fi-hel-core-01** (hel1) — deploy failed - **cc-fi-hel-k3s-02** (hel1) — deploy failed (k3s node) - **cc-fr-lau-store-01** (lau1) — deploy failed (Smokeping probe source) ## Cascade Effects (confirmed from Alertmanager + Loki) 1. **Node failure**: cc-se-sto-core-01 NotReady, 4 Cilium canary probes converging on same target = complete node unreachability 2. **Pod crashes**: repowise (CrashLoopBackOff, apt-get permission denied from Loki logs — non-root container running apt at startup) and external-dns (flag parsing error "unexpected false" from Loki logs) — both on node cc-fi-hel-k3s-04 3. **Deployment replica loss**: external-dns, flakecache, operations-memory, repowise all missing replicas since 19:40Z 4. **DNS outage**: PowerDNS master (ns-master) down since 19:31Z — zone propagation blocked 5. **Headscale operator**: Service namespace gone (DNS "no such host" for headscale-operator-metrics.svc) 6. **Backup failures**: ~13 KubernetesAgentBackupControlJobFailed (BackoffLimitExceeded) since 19:30Z 7. **Longhorn maintenance**: 3x snapshot-purge-watchdog PITFALL-252 (chronic benign) 8. **CNPG stale backup**: flakecache-postgres >50h no base backup (critical) 9. **Smokeping re-fire**: cc-fr-lau-store-01 → 89.167.50.230 chronic (tick re-fire, benign) ## Confidence Levels - **HIGH**: NixosHostDeployFailed (4 hosts, same timestamp), Cilium canary convergence (4 sites → same target), NodeNotReady - **HIGH**: External-dns crash loop (flag parsing error from Loki), repowise crash loop (apt-get permission denied from Loki) - **HIGH**: Headscale operator metrics = service namespace removed (DNS "no such host") - **MEDIUM**: CNPG stale backup — cascade from node failure - **LOW**: MailHeartbeatMissed — could be DNS master down causing ticket pipeline failure ## Resolution Priority 1. **NixosHostDeployFailed** — operator SSH to diagnose each host's failure mode (different modes possible) 2. **Node cc-se-sto-core-01** — k3s post-activation failure (Mode C) needs config rollback 3. **external-dns** — flag parsing error in deployment spec (need K8s API access to inspect) 4. **repowise** — non-root container apt-get at startup (Ref: container-startup-apt-get-permission-denied skill) 5. **DNS master** — PowerDNS master down, needs investigation ## Environment Note /tmp missing on agent — all file tools blocked (write_file, read_file, patch). /tmp must be created externally before cron jobs can execute.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set

Reference

singularity/singularity-forge#80

No description provided.

Rows
Columns