diff --git a/docs/incidents/2026-05-17-traefik-ip-collision.md b/docs/incidents/2026-05-17-traefik-ip-collision.md new file mode 100644 index 0000000..f479daf --- /dev/null +++ b/docs/incidents/2026-05-17-traefik-ip-collision.md @@ -0,0 +1,68 @@ +# Incident: Traefik IP Collision on dockerproxy Network + +**Date:** 2026-05-17 (root cause 04:02–04:39 UTC) +**Severity:** P1 — full reverse-proxy outage +**Status:** Resolved +**Affected:** all `*.xtrm-lab.org` services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others) + +--- + +## Symptoms + +- `traefik` container stuck in `Created` state, never reached `Running`. +- `docker inspect traefik` reported: + - `State.Status: created` + - `State.Error: failed to set up container networking: Address already in use` + - `State.ExitCode: 128` +- All `*.xtrm-lab.org` hostnames unreachable through the proxy. +- `traefik-manager` sidecar remained healthy but isolated. + +--- + +## Root Cause + +Static-IP collision on the `dockerproxy` Docker bridge (`172.18.0.0/16`). + +Timeline (UTC): +- **04:02:40** — `ewa-apps` (Dockge stack at `/mnt/user/appdata/dockge/stacks/ewa-apps`) restarted. Its compose joined `dockerproxy` without `ipv4_address`, so Docker handed it the lowest free IP: **172.18.0.3** (Traefik's reserved IP, but Traefik was momentarily down). +- **04:39:09** — `traefik` was recreated and requested its IPAMConfig-reserved `172.18.0.3`. Already taken — container left in `Created`. + +`dockerproxy` is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (`ewa-apps`) had no reservation. First-to-start wins the IP race. + +--- + +## Resolution + +1. `docker stop ewa-apps` (released .3) +2. `docker start traefik` (claimed .3 via IPAMConfig) +3. `docker start ewa-apps` (got .4) +4. Logs immediately showed Authentik/Traefik routing chains resuming. + +--- + +## Preventive Fix + +Pinned static IP for `ewa-apps` in its Dockge compose: + +```yaml +networks: + dockerproxy: + ipv4_address: 172.18.0.70 +``` + +Backup of original at `/mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17`. + +Final state after `docker compose up -d`: + +| Container | IP | +|-----------|-----| +| traefik | 172.18.0.3 | +| ewa-apps | 172.18.0.70 | + +--- + +## Follow-up + +- Audit remaining Dockge stacks on `dockerproxy` for missing `ipv4_address` reservations. +- Add a Uptime Kuma docker-state monitor for `traefik` (HTTP probes route *through* Traefik and fail uselessly when Traefik itself is the outage). +- Two Traefik-IP incidents in 12 days (also 2026-05-05 vaultwarden misroute) — consider a deliberate redesign of the dockerproxy IP map.