Tick ~425 — continuity-portal CrashLoopBackOff from DB service unreachability (low-priority) #81
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Triage notes — Tick ~425 (2026-07-02T06:25Z)
Alert:
KubernetesPodCrashLooping{continuity-system/continuity-portal-86d9b56d54-xzvmp}priority=low, severity=warning, startsAt 06:24:00Z.Evidence
xzvmp: 0 log rows in 30m (fail-fast). Kubelet exponential-backoff restart cycle.p9fd4(cc-de-fsn-core-01, fsn1): every-10s reconciler errorsfailover reconciler tick: dial tcp 10.43.47.69:5432: connect: no route to hostfor 3+ min (06:13:48–06:16:48) before termination.sql Scan error on column index 3, name "hostname": converting NULL to string is unsupported— DB was reachable at 06:13:08, then went away by 06:13:48.continuity-gateway-769c4c5bdd-rgxvc(hel1) alive, logs[HEALTH] Portal backend unreachable: dial tcp 10.43.143.213:80: connect: no route to hostevery 10s since 06:09:10Z.10.43.47.69:5432and10.43.143.213:80returnno route to host(notconnection refused) → Cilium datapath has no endpoint.namespace=continuity-systemreturns 3 alerts all starting 06:24:00Z:KubernetesPodCrashLooping{portal}(this alert)KubernetesDeploymentReplicasUnavailable{continuity-portal}(same time)KubernetesDeploymentReplicasUnavailable{continuity-gateway}(same time)Classification (HIGH confidence)
Fail-fast CrashLoopBackOff caused by downstream DB service unreachability (
10.43.47.69:5432).Likely cause
A sibling DB pod (CNPG cluster or standalone Postgres in
continuity-system) terminated around 06:13:30Z and has not been rescheduled. EndpointSlice for the DB Service is empty or stale, causing Cilium datapath to returnno route to hoston the ClusterIP. The portal code's startup-time connection check exits the process on failure (fail-fast). The previous podp9fd4was running for 3+ min with periodic reconciler errors (it survived because the reconciler logs but doesn't exit), then kubelet killed it (liveness probe fail) and replaced withxzvmpwhich fails fast at startup.Cascade
continuity-portaldeployment missing replicas (single replica, 0/N)continuity-gatewaydeployment missing replicas (at least one pod alive logging probes, but deployment below desired)Recommendation
No immediate operator action — alert is low priority, single replica outage, no data loss risk. Recommended durable fixes (in priority order):
kubectl -n continuity-system get pods,kubectl -n continuity-system get svc,kubectl -n continuity-system get endpointsfor the DB service.continuity-portalreplicas from 1 to 2+ for graceful failover.Infrastructure status during this tick