Platform incident 2026-07-01T19:30Z — NixosHostDeployFailed cascade + external-dns flag parsing crash #79
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Platform Incident 2026-07-01T19:30Z — NixosHostDeployFailed Cascade + External-DNS Flag Parsing Crash
Executive Summary
A multi-site infrastructure incident began at 19:22Z with a network partition to the sto1 (Stockholm) site node cc-se-sto-core-01, followed by a synchronized batch of 3-4 NixosHostDeployFailed events at 19:30Z across geographically diverse hosts (cc-fr-lau-store-01, cc-fi-hel-k3s-02, cc-fi-hel-core-01, cc-se-sto-core-01), and a separate external-dns HelmRelease crashloop triggered by a flag parsing regression. By 19:40Z, 6+ Kubernetes deployments were reporting unavailable replicas. Total active alert count: 37.
Status: Still active at 20:15Z — Nixos hosts unresolved, external-dns HelmRelease in failed state ("cannot remediate failed release"), node cc-se-sto-core-01 still NotReady, Grafana and repowise deployments down.
Cascade Timeline
Real Incidents (Action Required)
1. Node Partition / Host Failure: cc-se-sto-core-01 (sto1)
Severity: Critical
Evidence:
Assessment: cc-se-sto-core-01 unreachable across all source sites — suggests hardware failure, network equipment failure at sto1, or kernel/oops event.
2. NixosHostDeployFailed — Synchronized Batch at 19:30Z
Severity: High
Hosts affected:
Assessment: Simultaneous failure across 3 geographic sites suggests systemic issue — deploy-rs operator bug, problematic NixOS configuration change deployed simultaneously, or shared dependency failure.
3. External-DNS Crashloop — Flag Parsing Crash
Severity: High
Evidence:
Assessment: Broken flag configuration in recent commit "Fix ExternalDNS Traefik source flags". Helm chart upgrade fails because container crashes on startup with unrecognized command-line flag.
4. Grafana Deployment Down
Severity: Medium
Evidence: KubernetesDeploymentReplicasUnavailable for monitoring/grafana
Chronic Noise (No Intervention Needed)
: > "$work"blocked by readOnlyRootFilesystem. Auto-resolves ~40m.Root Cause Analysis
Primary: Network partition or hardware failure at cc-se-sto-core-01 (sto1) starting ~19:22Z, confirmed by 4 independent Cilium canary probes.
Secondary: Synchronized NixosHostDeployFailed across 4 hosts at 19:30Z — partially explained by sto1 partition (cc-se-sto-core-01 unreachable), but other 3 hosts (lau1, hel1×2) suggest systemic issue (operator bug, bad NixOS change, or shared dependency outage).
Tertiary: External-dns flag parsing crash — separate root cause from broken "Fix ExternalDNS Traefik source flags" GitOps commit.
Impact
Recommended Actions
Operational Notes