Tick ~425 — continuity-portal CrashLoopBackOff from DB service unreachability (low-priority) #81

New issue

Open

opened 2026-07-02 06:44:29 +00:00 by mhugo · 0 comments

mhugo commented

2026-07-02 06:44:29 +00:00

Owner

Triage notes — Tick ~425 (2026-07-02T06:25Z)

Alert: KubernetesPodCrashLooping{continuity-system/continuity-portal-86d9b56d54-xzvmp} priority=low, severity=warning, startsAt 06:24:00Z.

Evidence

Current alert pod xzvmp: 0 log rows in 30m (fail-fast). Kubelet exponential-backoff restart cycle.
Previous pod p9fd4 (cc-de-fsn-core-01, fsn1): every-10s reconciler errors failover reconciler tick: dial tcp 10.43.47.69:5432: connect: no route to host for 3+ min (06:13:48–06:16:48) before termination.
Single successful query at 06:13:08: sql Scan error on column index 3, name "hostname": converting NULL to string is unsupported — DB was reachable at 06:13:08, then went away by 06:13:48.
Sibling gateway pod continuity-gateway-769c4c5bdd-rgxvc (hel1) alive, logs [HEALTH] Portal backend unreachable: dial tcp 10.43.143.213:80: connect: no route to host every 10s since 06:09:10Z.
Both 10.43.47.69:5432 and 10.43.143.213:80 return no route to host (not connection refused) → Cilium datapath has no endpoint.
Alertmanager namespace=continuity-system returns 3 alerts all starting 06:24:00Z:
- KubernetesPodCrashLooping{portal} (this alert)
- KubernetesDeploymentReplicasUnavailable{continuity-portal} (same time)
- KubernetesDeploymentReplicasUnavailable{continuity-gateway} (same time)

Classification (HIGH confidence)

Fail-fast CrashLoopBackOff caused by downstream DB service unreachability (10.43.47.69:5432).

Likely cause

A sibling DB pod (CNPG cluster or standalone Postgres in continuity-system) terminated around 06:13:30Z and has not been rescheduled. EndpointSlice for the DB Service is empty or stale, causing Cilium datapath to return no route to host on the ClusterIP. The portal code's startup-time connection check exits the process on failure (fail-fast). The previous pod p9fd4 was running for 3+ min with periodic reconciler errors (it survived because the reconciler logs but doesn't exit), then kubelet killed it (liveness probe fail) and replaced with xzvmp which fails fast at startup.

Cascade

continuity-portal deployment missing replicas (single replica, 0/N)
continuity-gateway deployment missing replicas (at least one pod alive logging probes, but deployment below desired)
No impact on other namespaces / no data loss / no security risk

Recommendation

No immediate operator action — alert is low priority, single replica outage, no data loss risk. Recommended durable fixes (in priority order):

Investigate sibling DB pod termination — likely CNPG cluster scale-down residue (cf. PITFALL-127/131) or evicted Postgres pod. Once K8s MCP recovers: kubectl -n continuity-system get pods, kubectl -n continuity-system get svc, kubectl -n continuity-system get endpoints for the DB service.
Portal code durability: portal should not exit on startup DB connection failure; should retry with exponential backoff. Init container that waits for DB readiness would prevent the fail-fast pattern entirely.
HA: increase continuity-portal replicas from 1 to 2+ for graceful failover.

Infrastructure status during this tick

K8s MCP: 32+ consecutive failures, Session terminated (~22s cooldown)
Observability MCP: ClickHouse MEMORY_LIMIT_EXCEEDED on broad queries (known pitfall)
memory_store / memory_core: "No route to host" (memory backend unreachable)
/tmp: missing (all file tools blocked at framework layer)
write_file / terminal / execute_code: blocked (No usable temp directory)
This issue is the Tier-2 fallback record per the skill's "All tools blocked" pitfall.

## Triage notes — Tick ~425 (2026-07-02T06:25Z) **Alert**: `KubernetesPodCrashLooping{continuity-system/continuity-portal-86d9b56d54-xzvmp}` priority=low, severity=warning, startsAt 06:24:00Z. ### Evidence - **Current alert pod `xzvmp`**: 0 log rows in 30m (fail-fast). Kubelet exponential-backoff restart cycle. - **Previous pod `p9fd4`** (cc-de-fsn-core-01, fsn1): every-10s reconciler errors `failover reconciler tick: dial tcp 10.43.47.69:5432: connect: no route to host` for 3+ min (06:13:48–06:16:48) before termination. - **Single successful query** at 06:13:08: `sql Scan error on column index 3, name "hostname": converting NULL to string is unsupported` — DB was reachable at 06:13:08, then went away by 06:13:48. - **Sibling gateway pod `continuity-gateway-769c4c5bdd-rgxvc`** (hel1) alive, logs `[HEALTH] Portal backend unreachable: dial tcp 10.43.143.213:80: connect: no route to host` every 10s since 06:09:10Z. - Both `10.43.47.69:5432` and `10.43.143.213:80` return `no route to host` (not `connection refused`) → Cilium datapath has no endpoint. - Alertmanager `namespace=continuity-system` returns 3 alerts all starting 06:24:00Z: - `KubernetesPodCrashLooping{portal}` (this alert) - `KubernetesDeploymentReplicasUnavailable{continuity-portal}` (same time) - `KubernetesDeploymentReplicasUnavailable{continuity-gateway}` (same time) ### Classification (HIGH confidence) **Fail-fast CrashLoopBackOff caused by downstream DB service unreachability** (`10.43.47.69:5432`). ### Likely cause A sibling DB pod (CNPG cluster or standalone Postgres in `continuity-system`) terminated around 06:13:30Z and has not been rescheduled. EndpointSlice for the DB Service is empty or stale, causing Cilium datapath to return `no route to host` on the ClusterIP. The portal code's startup-time connection check exits the process on failure (fail-fast). The previous pod `p9fd4` was running for 3+ min with periodic reconciler errors (it survived because the reconciler logs but doesn't exit), then kubelet killed it (liveness probe fail) and replaced with `xzvmp` which fails fast at startup. ### Cascade - `continuity-portal` deployment missing replicas (single replica, 0/N) - `continuity-gateway` deployment missing replicas (at least one pod alive logging probes, but deployment below desired) - No impact on other namespaces / no data loss / no security risk ### Recommendation **No immediate operator action** — alert is low priority, single replica outage, no data loss risk. Recommended durable fixes (in priority order): 1. **Investigate sibling DB pod termination** — likely CNPG cluster scale-down residue (cf. PITFALL-127/131) or evicted Postgres pod. Once K8s MCP recovers: `kubectl -n continuity-system get pods`, `kubectl -n continuity-system get svc`, `kubectl -n continuity-system get endpoints` for the DB service. 2. **Portal code durability**: portal should not exit on startup DB connection failure; should retry with exponential backoff. Init container that waits for DB readiness would prevent the fail-fast pattern entirely. 3. **HA**: increase `continuity-portal` replicas from 1 to 2+ for graceful failover. ### Infrastructure status during this tick - K8s MCP: 32+ consecutive failures, Session terminated (~22s cooldown) - Observability MCP: ClickHouse MEMORY_LIMIT_EXCEEDED on broad queries (known pitfall) - memory_store / memory_core: "No route to host" (memory backend unreachable) - /tmp: missing (all file tools blocked at framework layer) - write_file / terminal / execute_code: blocked (No usable temp directory) - This issue is the Tier-2 fallback record per the skill's "All tools blocked" pitfall.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set

Reference

singularity/singularity-forge#81

No description provided.

Rows
Columns