incident: traefik IP collision on dockerproxy (2026-05-17)
This commit is contained in:
@@ -0,0 +1,68 @@
|
|||||||
|
# Incident: Traefik IP Collision on dockerproxy Network
|
||||||
|
|
||||||
|
**Date:** 2026-05-17 (root cause 04:02–04:39 UTC)
|
||||||
|
**Severity:** P1 — full reverse-proxy outage
|
||||||
|
**Status:** Resolved
|
||||||
|
**Affected:** all `*.xtrm-lab.org` services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Symptoms
|
||||||
|
|
||||||
|
- `traefik` container stuck in `Created` state, never reached `Running`.
|
||||||
|
- `docker inspect traefik` reported:
|
||||||
|
- `State.Status: created`
|
||||||
|
- `State.Error: failed to set up container networking: Address already in use`
|
||||||
|
- `State.ExitCode: 128`
|
||||||
|
- All `*.xtrm-lab.org` hostnames unreachable through the proxy.
|
||||||
|
- `traefik-manager` sidecar remained healthy but isolated.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
Static-IP collision on the `dockerproxy` Docker bridge (`172.18.0.0/16`).
|
||||||
|
|
||||||
|
Timeline (UTC):
|
||||||
|
- **04:02:40** — `ewa-apps` (Dockge stack at `/mnt/user/appdata/dockge/stacks/ewa-apps`) restarted. Its compose joined `dockerproxy` without `ipv4_address`, so Docker handed it the lowest free IP: **172.18.0.3** (Traefik's reserved IP, but Traefik was momentarily down).
|
||||||
|
- **04:39:09** — `traefik` was recreated and requested its IPAMConfig-reserved `172.18.0.3`. Already taken — container left in `Created`.
|
||||||
|
|
||||||
|
`dockerproxy` is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (`ewa-apps`) had no reservation. First-to-start wins the IP race.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resolution
|
||||||
|
|
||||||
|
1. `docker stop ewa-apps` (released .3)
|
||||||
|
2. `docker start traefik` (claimed .3 via IPAMConfig)
|
||||||
|
3. `docker start ewa-apps` (got .4)
|
||||||
|
4. Logs immediately showed Authentik/Traefik routing chains resuming.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Preventive Fix
|
||||||
|
|
||||||
|
Pinned static IP for `ewa-apps` in its Dockge compose:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
networks:
|
||||||
|
dockerproxy:
|
||||||
|
ipv4_address: 172.18.0.70
|
||||||
|
```
|
||||||
|
|
||||||
|
Backup of original at `/mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17`.
|
||||||
|
|
||||||
|
Final state after `docker compose up -d`:
|
||||||
|
|
||||||
|
| Container | IP |
|
||||||
|
|-----------|-----|
|
||||||
|
| traefik | 172.18.0.3 |
|
||||||
|
| ewa-apps | 172.18.0.70 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Follow-up
|
||||||
|
|
||||||
|
- Audit remaining Dockge stacks on `dockerproxy` for missing `ipv4_address` reservations.
|
||||||
|
- Add a Uptime Kuma docker-state monitor for `traefik` (HTTP probes route *through* Traefik and fail uselessly when Traefik itself is the outage).
|
||||||
|
- Two Traefik-IP incidents in 12 days (also 2026-05-05 vaultwarden misroute) — consider a deliberate redesign of the dockerproxy IP map.
|
||||||
Reference in New Issue
Block a user