Files
infrastructure/docs/incidents/2026-05-17-traefik-ip-collision.md
T

69 lines
2.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Incident: Traefik IP Collision on dockerproxy Network
**Date:** 2026-05-17 (root cause 04:0204:39 UTC)
**Severity:** P1 — full reverse-proxy outage
**Status:** Resolved
**Affected:** all `*.xtrm-lab.org` services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)
---
## Symptoms
- `traefik` container stuck in `Created` state, never reached `Running`.
- `docker inspect traefik` reported:
- `State.Status: created`
- `State.Error: failed to set up container networking: Address already in use`
- `State.ExitCode: 128`
- All `*.xtrm-lab.org` hostnames unreachable through the proxy.
- `traefik-manager` sidecar remained healthy but isolated.
---
## Root Cause
Static-IP collision on the `dockerproxy` Docker bridge (`172.18.0.0/16`).
Timeline (UTC):
- **04:02:40** — `ewa-apps` (Dockge stack at `/mnt/user/appdata/dockge/stacks/ewa-apps`) restarted. Its compose joined `dockerproxy` without `ipv4_address`, so Docker handed it the lowest free IP: **172.18.0.3** (Traefik's reserved IP, but Traefik was momentarily down).
- **04:39:09** — `traefik` was recreated and requested its IPAMConfig-reserved `172.18.0.3`. Already taken — container left in `Created`.
`dockerproxy` is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (`ewa-apps`) had no reservation. First-to-start wins the IP race.
---
## Resolution
1. `docker stop ewa-apps` (released .3)
2. `docker start traefik` (claimed .3 via IPAMConfig)
3. `docker start ewa-apps` (got .4)
4. Logs immediately showed Authentik/Traefik routing chains resuming.
---
## Preventive Fix
Pinned static IP for `ewa-apps` in its Dockge compose:
```yaml
networks:
dockerproxy:
ipv4_address: 172.18.0.70
```
Backup of original at `/mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17`.
Final state after `docker compose up -d`:
| Container | IP |
|-----------|-----|
| traefik | 172.18.0.3 |
| ewa-apps | 172.18.0.70 |
---
## Follow-up
- Audit remaining Dockge stacks on `dockerproxy` for missing `ipv4_address` reservations.
- Add a Uptime Kuma docker-state monitor for `traefik` (HTTP probes route *through* Traefik and fail uselessly when Traefik itself is the outage).
- Two Traefik-IP incidents in 12 days (also 2026-05-05 vaultwarden misroute) — consider a deliberate redesign of the dockerproxy IP map.