singularity/singularity-forge

Fork 0

Platform incident 2026-07-01T19:30Z — NixosHostDeployFailed cascade + external-dns flag parsing crash #79

New issue

Open

opened 2026-07-01 20:15:59 +00:00 by mhugo · 0 comments

mhugo commented

2026-07-01 20:15:59 +00:00

Owner

Platform Incident 2026-07-01T19:30Z — NixosHostDeployFailed Cascade + External-DNS Flag Parsing Crash

Executive Summary

A multi-site infrastructure incident began at 19:22Z with a network partition to the sto1 (Stockholm) site node cc-se-sto-core-01, followed by a synchronized batch of 3-4 NixosHostDeployFailed events at 19:30Z across geographically diverse hosts (cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-fi-hel-core-01, cc-se-sto-core-01), and a separate external-dns HelmRelease crashloop triggered by a flag parsing regression. By 19:40Z, 6+ Kubernetes deployments were reporting unavailable replicas. Total active alert count: 37.

Status: Still active at 20:15Z — Nixos hosts unresolved, external-dns HelmRelease in failed state ("cannot remediate failed release"), node cc-se-sto-core-01 still NotReady, Grafana and repowise deployments down.

Cascade Timeline

Time (UTC)	Event
19:22Z	Cilium canary probes begin failing for cc-se-sto-core-01 from 4 source sites: nue1, lau1, hel1, fsn1
19:26Z	Cilium canary tcp_connect alert fires (target=cc-se-sto-core-01, sources=cc-fr-lau-store-02, cc-de-fsn-core-01)
19:26Z	HeadscaleOperatorMetricsDown fires (metrics endpoint unreachable)
19:30Z	NixosHostDeployFailed batch: cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-se-sto-core-01, cc-fi-hel-core-01 — all phase=Failed
19:31Z	DNSMasterDown fires (ns-master down 15m+, zone updates blocked)
19:32Z	SmokepingInterSiteLatencyHigh fires (cc-fr-lau-store-01 → 89.167.50.230 p95 >80ms)
19:35Z	KubernetesPodCrashLooping: external-dns and repowise pods enter CrashLoopBackOff
19:35Z	KubernetesNodeNotReady: cc-se-sto-core-01 node status changes to NotReady
19:40Z	KubernetesDeploymentReplicasUnavailable: external-dns, flakecache, operations-memory, grafana, repowise — 5 deployments missing replicas
19:41Z+	LonghornMaintenanceJobFailed continues (chronic, PITFALL-252)
19:49Z	CNPGBaseBackupVeryStale fires for flakecache-postgres (>50h since last backup)

Real Incidents (Action Required)

1. Node Partition / Host Failure: cc-se-sto-core-01 (sto1)

Severity: Critical

Evidence:

4x CiliumCanaryProbeFailing alerts (tcp_connect failures from cc-de-nue-k3s-01, cc-fr-lau-store-02, cc-de-fsn-core-01, cc-fi-hel-core-01)
NixosHostDeployFailed with phase=Failed
KubernetesNodeNotReady alert

Assessment: cc-se-sto-core-01 unreachable across all source sites — suggests hardware failure, network equipment failure at sto1, or kernel/oops event.

2. NixosHostDeployFailed — Synchronized Batch at 19:30Z

Severity: High

Hosts affected:

cc-fr-lau-store-01 (lau1)
cc-fi-hel-k3s-02 (hel1)
cc-fi-hel-core-01 (hel1)
cc-se-sto-core-01 (sto1)

Assessment: Simultaneous failure across 3 geographic sites suggests systemic issue — deploy-rs operator bug, problematic NixOS configuration change deployed simultaneously, or shared dependency failure.

3. External-DNS Crashloop — Flag Parsing Crash

Severity: High

Evidence:

external-dns pods enter CrashLoopBackOff at 19:35Z
Both external-dns and external-dns-pdns HelmReleases stuck: "terminal error: exceeded maximum retries: cannot remediate failed release"
Source-controller logs: "stored artifact for commit 'Fix ExternalDNS Traefik source flags'" at 20:08Z

Assessment: Broken flag configuration in recent commit "Fix ExternalDNS Traefik source flags". Helm chart upgrade fails because container crashes on startup with unrecognized command-line flag.

4. Grafana Deployment Down

Severity: Medium
Evidence: KubernetesDeploymentReplicasUnavailable for monitoring/grafana

Chronic Noise (No Intervention Needed)

LonghornMaintenanceJobFailed (~5+ active) — PITFALL-252. Root cause: start.sh : > "$work" blocked by readOnlyRootFilesystem. Auto-resolves ~40m.
KubernetesAgentBackupControlJobFailed (~13+ active) — backup-audit / backup-label-reconciler BackoffLimitExceeded. Chronic noise.
SmokepingInterSiteLatencyHigh (3 targets) — Chronic high latency from known slow targets.
CNPGBaseBackupStale — flakecache/flakecache-postgres (>28h stale). Green WAL ≠ restorable.

Root Cause Analysis

Primary: Network partition or hardware failure at cc-se-sto-core-01 (sto1) starting ~19:22Z, confirmed by 4 independent Cilium canary probes.

Secondary: Synchronized NixosHostDeployFailed across 4 hosts at 19:30Z — partially explained by sto1 partition (cc-se-sto-core-01 unreachable), but other 3 hosts (lau1, hel1×2) suggest systemic issue (operator bug, bad NixOS change, or shared dependency outage).

Tertiary: External-dns flag parsing crash — separate root cause from broken "Fix ExternalDNS Traefik source flags" GitOps commit.

Impact

Network: cc-se-sto-core-01 unreachable from all cluster nodes
DNS: External-dns down → new/updated DNS records not propagating (PowerDNS master also down)
Monitoring: Grafana down → reduced observability
NixOS Deployments: 4 host deployments failed
Data: CNPG base backup stale for flakecache-postgres

Recommended Actions

URGENT: Investigate cc-se-sto-core-01 — check power, network, IPMI, hardware status
URGENT: Roll back "Fix ExternalDNS Traefik source flags" commit; remediate external-dns HelmReleases
HIGH: Investigate synchronized NixosHostDeployFailed — check deploy-rs operator logs at 19:30Z
MEDIUM: Check flakecache-postgres backup status and barman-cloud plugin logs
MEDIUM: Restore Grafana — check if sto1-affinity scheduling issue or separate problem

Operational Notes

K8s MCP unreachable (76+ consecutive failures)
Prometheus MCP unreachable (7+ consecutive failures)
ClickHouse cluster-logs memory limit exceeded (3.60/3.60 GiB)
/tmp missing from agent sandbox — terminal/executable tools completely blocked
Alertmanager cluster: 3 peers, status ready (v0.33.0, rev 5d3ceb55)
37 total active alerts (all non-silenced) at analysis time
Pre-existing issue tracking initial triage: #78

# Platform Incident 2026-07-01T19:30Z — NixosHostDeployFailed Cascade + External-DNS Flag Parsing Crash ## Executive Summary A multi-site infrastructure incident began at 19:22Z with a network partition to the sto1 (Stockholm) site node cc-se-sto-core-01, followed by a synchronized batch of 3-4 NixosHostDeployFailed events at 19:30Z across geographically diverse hosts (cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-fi-hel-core-01, cc-se-sto-core-01), and a separate external-dns HelmRelease crashloop triggered by a flag parsing regression. By 19:40Z, 6+ Kubernetes deployments were reporting unavailable replicas. Total active alert count: 37. **Status**: Still active at 20:15Z — Nixos hosts unresolved, external-dns HelmRelease in failed state ("cannot remediate failed release"), node cc-se-sto-core-01 still NotReady, Grafana and repowise deployments down. ## Cascade Timeline | Time (UTC) | Event | |------------|-------| | 19:22Z | Cilium canary probes begin failing for cc-se-sto-core-01 from 4 source sites: nue1, lau1, hel1, fsn1 | | 19:26Z | Cilium canary tcp_connect alert fires (target=cc-se-sto-core-01, sources=cc-fr-lau-store-02, cc-de-fsn-core-01) | | 19:26Z | HeadscaleOperatorMetricsDown fires (metrics endpoint unreachable) | | 19:30Z | **NixosHostDeployFailed batch**: cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-se-sto-core-01, cc-fi-hel-core-01 — all phase=Failed | | 19:31Z | DNSMasterDown fires (ns-master down 15m+, zone updates blocked) | | 19:32Z | SmokepingInterSiteLatencyHigh fires (cc-fr-lau-store-01 → 89.167.50.230 p95 >80ms) | | 19:35Z | KubernetesPodCrashLooping: external-dns and repowise pods enter CrashLoopBackOff | | 19:35Z | KubernetesNodeNotReady: cc-se-sto-core-01 node status changes to NotReady | | 19:40Z | KubernetesDeploymentReplicasUnavailable: external-dns, flakecache, operations-memory, grafana, repowise — 5 deployments missing replicas | | 19:41Z+ | LonghornMaintenanceJobFailed continues (chronic, PITFALL-252) | | 19:49Z | CNPGBaseBackupVeryStale fires for flakecache-postgres (>50h since last backup) | ## Real Incidents (Action Required) ### 1. Node Partition / Host Failure: cc-se-sto-core-01 (sto1) **Severity**: Critical **Evidence**: - 4x CiliumCanaryProbeFailing alerts (tcp_connect failures from cc-de-nue-k3s-01, cc-fr-lau-store-02, cc-de-fsn-core-01, cc-fi-hel-core-01) - NixosHostDeployFailed with phase=Failed - KubernetesNodeNotReady alert **Assessment**: cc-se-sto-core-01 unreachable across all source sites — suggests hardware failure, network equipment failure at sto1, or kernel/oops event. ### 2. NixosHostDeployFailed — Synchronized Batch at 19:30Z **Severity**: High **Hosts affected**: - cc-fr-lau-store-01 (lau1) - cc-fi-hel-k3s-02 (hel1) - cc-fi-hel-core-01 (hel1) - cc-se-sto-core-01 (sto1) **Assessment**: Simultaneous failure across 3 geographic sites suggests systemic issue — deploy-rs operator bug, problematic NixOS configuration change deployed simultaneously, or shared dependency failure. ### 3. External-DNS Crashloop — Flag Parsing Crash **Severity**: High **Evidence**: - external-dns pods enter CrashLoopBackOff at 19:35Z - Both external-dns and external-dns-pdns HelmReleases stuck: "terminal error: exceeded maximum retries: cannot remediate failed release" - Source-controller logs: "stored artifact for commit 'Fix ExternalDNS Traefik source flags'" at 20:08Z **Assessment**: Broken flag configuration in recent commit "Fix ExternalDNS Traefik source flags". Helm chart upgrade fails because container crashes on startup with unrecognized command-line flag. ### 4. Grafana Deployment Down **Severity**: Medium **Evidence**: KubernetesDeploymentReplicasUnavailable for monitoring/grafana ## Chronic Noise (No Intervention Needed) 1. **LonghornMaintenanceJobFailed** (~5+ active) — PITFALL-252. Root cause: start.sh `: > "$work"` blocked by readOnlyRootFilesystem. Auto-resolves ~40m. 2. **KubernetesAgentBackupControlJobFailed** (~13+ active) — backup-audit / backup-label-reconciler BackoffLimitExceeded. Chronic noise. 3. **SmokepingInterSiteLatencyHigh** (3 targets) — Chronic high latency from known slow targets. 4. **CNPGBaseBackupStale** — flakecache/flakecache-postgres (>28h stale). Green WAL ≠ restorable. ## Root Cause Analysis **Primary**: Network partition or hardware failure at cc-se-sto-core-01 (sto1) starting ~19:22Z, confirmed by 4 independent Cilium canary probes. **Secondary**: Synchronized NixosHostDeployFailed across 4 hosts at 19:30Z — partially explained by sto1 partition (cc-se-sto-core-01 unreachable), but other 3 hosts (lau1, hel1×2) suggest systemic issue (operator bug, bad NixOS change, or shared dependency outage). **Tertiary**: External-dns flag parsing crash — separate root cause from broken "Fix ExternalDNS Traefik source flags" GitOps commit. ## Impact - **Network**: cc-se-sto-core-01 unreachable from all cluster nodes - **DNS**: External-dns down → new/updated DNS records not propagating (PowerDNS master also down) - **Monitoring**: Grafana down → reduced observability - **NixOS Deployments**: 4 host deployments failed - **Data**: CNPG base backup stale for flakecache-postgres ## Recommended Actions 1. **URGENT**: Investigate cc-se-sto-core-01 — check power, network, IPMI, hardware status 2. **URGENT**: Roll back "Fix ExternalDNS Traefik source flags" commit; remediate external-dns HelmReleases 3. **HIGH**: Investigate synchronized NixosHostDeployFailed — check deploy-rs operator logs at 19:30Z 4. **MEDIUM**: Check flakecache-postgres backup status and barman-cloud plugin logs 5. **MEDIUM**: Restore Grafana — check if sto1-affinity scheduling issue or separate problem ## Operational Notes - K8s MCP unreachable (76+ consecutive failures) - Prometheus MCP unreachable (7+ consecutive failures) - ClickHouse cluster-logs memory limit exceeded (3.60/3.60 GiB) - /tmp missing from agent sandbox — terminal/executable tools completely blocked - Alertmanager cluster: 3 peers, status ready (v0.33.0, rev 5d3ceb55) - 37 total active alerts (all non-silenced) at analysis time - Pre-existing issue tracking initial triage: #78

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set

Reference

singularity/singularity-forge#79

No description provided.

Rows
Columns