Files
infrastructure/docs/incidents/2026-05-17-traefik-ip-collision.md
jazzymc dd1c15cf6b dockerproxy: redesign IPAM with static block + dynamic /25 pool
Recreated dockerproxy network with --ip-range 172.18.0.128/25 so Docker
auto-allocations are isolated from the .2-.127 static reservation block.
Eliminates IP-collision class that caused the 2026-05-17 Traefik outage.

Adds 13-DOCKERPROXY-NETWORK.md as the canonical reference for the
network spec, recreate command, and current IP assignments.
2026-05-17 08:36:46 +03:00

98 lines
3.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Incident: Traefik IP Collision on dockerproxy Network
**Date:** 2026-05-17 (root cause 04:0204:39 UTC)
**Severity:** P1 — full reverse-proxy outage
**Status:** Resolved
**Affected:** all `*.xtrm-lab.org` services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)
---
## Symptoms
- `traefik` container stuck in `Created` state, never reached `Running`.
- `docker inspect traefik` reported:
- `State.Status: created`
- `State.Error: failed to set up container networking: Address already in use`
- `State.ExitCode: 128`
- All `*.xtrm-lab.org` hostnames unreachable through the proxy.
- `traefik-manager` sidecar remained healthy but isolated.
---
## Root Cause
Static-IP collision on the `dockerproxy` Docker bridge (`172.18.0.0/16`).
Timeline (UTC):
- **04:02:40** — `ewa-apps` (Dockge stack at `/mnt/user/appdata/dockge/stacks/ewa-apps`) restarted. Its compose joined `dockerproxy` without `ipv4_address`, so Docker handed it the lowest free IP: **172.18.0.3** (Traefik's reserved IP, but Traefik was momentarily down).
- **04:39:09** — `traefik` was recreated and requested its IPAMConfig-reserved `172.18.0.3`. Already taken — container left in `Created`.
`dockerproxy` is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (`ewa-apps`) had no reservation. First-to-start wins the IP race.
---
## Resolution
1. `docker stop ewa-apps` (released .3)
2. `docker start traefik` (claimed .3 via IPAMConfig)
3. `docker start ewa-apps` (got .4)
4. Logs immediately showed Authentik/Traefik routing chains resuming.
---
## Preventive Fix
Pinned static IP for `ewa-apps` in its Dockge compose:
```yaml
networks:
dockerproxy:
ipv4_address: 172.18.0.70
```
Backup of original at `/mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17`.
Final state after `docker compose up -d`:
| Container | IP |
|-----------|-----|
| traefik | 172.18.0.3 |
| ewa-apps | 172.18.0.70 |
---
## Structural Fix — IPAM Redesign (~05:15 UTC same day)
Compose-level pinning fixes one container at a time. To remove the root cause class, the `dockerproxy` network was recreated with a dedicated dynamic IP range. Since Docker network IPAM is immutable, the network had to be torn down and rebuilt with all 32 containers offline (~3 min outage).
New IPAM:
```
Subnet: 172.18.0.0/16
Gateway: 172.18.0.1
IPRange: 172.18.0.128/25 ← Docker only auto-assigns from .128.255
```
- `.2``.127` is now a **static-only reservation block**
- `.128``.255` is the dynamic pool
Procedure (scripted; full state snapshot at `/root/dockerproxy-recreate-2026-05-17/` on Unraid):
1. Capture container→static-IP map from `docker inspect`
2. `docker stop` all 32 in parallel
3. `docker network disconnect dockerproxy <each>`
4. `docker network rm dockerproxy`
5. `docker network create --driver bridge --subnet 172.18.0.0/16 --gateway 172.18.0.1 --ip-range 172.18.0.128/25 dockerproxy`
6. `docker network connect --ip <preserved-static> dockerproxy <each>` (no `--ip` for the one dynamic container, `traefik-manager`)
7. Start in dependency order: dockersocket → postgresql17 + Redis → traefik + traefik-manager → authentik + authentik-worker → rest
Final state: all 32 containers up, all static IPs preserved (.3.70), traefik-manager moved from `.2` to `.128` (first dynamic slot).
Reference: `docs/13-DOCKERPROXY-NETWORK.md` documents the network spec, recreate command, and current IP assignments.
## Follow-up
- IPAM redesign eliminates the auto-allocation collision class. Future stacks that omit `ipv4_address` land safely in `.128+`.
- Network is imperative (not in any compose); recreate command is documented in `docs/13-DOCKERPROXY-NETWORK.md` for disaster recovery.
- Add an Uptime Kuma docker-state monitor for `traefik` (HTTP probes route through Traefik and fail uselessly when Traefik is the outage).