dd1c15cf6b
Recreated dockerproxy network with --ip-range 172.18.0.128/25 so Docker auto-allocations are isolated from the .2-.127 static reservation block. Eliminates IP-collision class that caused the 2026-05-17 Traefik outage. Adds 13-DOCKERPROXY-NETWORK.md as the canonical reference for the network spec, recreate command, and current IP assignments.
98 lines
3.9 KiB
Markdown
98 lines
3.9 KiB
Markdown
# Incident: Traefik IP Collision on dockerproxy Network
|
||
|
||
**Date:** 2026-05-17 (root cause 04:02–04:39 UTC)
|
||
**Severity:** P1 — full reverse-proxy outage
|
||
**Status:** Resolved
|
||
**Affected:** all `*.xtrm-lab.org` services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)
|
||
|
||
---
|
||
|
||
## Symptoms
|
||
|
||
- `traefik` container stuck in `Created` state, never reached `Running`.
|
||
- `docker inspect traefik` reported:
|
||
- `State.Status: created`
|
||
- `State.Error: failed to set up container networking: Address already in use`
|
||
- `State.ExitCode: 128`
|
||
- All `*.xtrm-lab.org` hostnames unreachable through the proxy.
|
||
- `traefik-manager` sidecar remained healthy but isolated.
|
||
|
||
---
|
||
|
||
## Root Cause
|
||
|
||
Static-IP collision on the `dockerproxy` Docker bridge (`172.18.0.0/16`).
|
||
|
||
Timeline (UTC):
|
||
- **04:02:40** — `ewa-apps` (Dockge stack at `/mnt/user/appdata/dockge/stacks/ewa-apps`) restarted. Its compose joined `dockerproxy` without `ipv4_address`, so Docker handed it the lowest free IP: **172.18.0.3** (Traefik's reserved IP, but Traefik was momentarily down).
|
||
- **04:39:09** — `traefik` was recreated and requested its IPAMConfig-reserved `172.18.0.3`. Already taken — container left in `Created`.
|
||
|
||
`dockerproxy` is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (`ewa-apps`) had no reservation. First-to-start wins the IP race.
|
||
|
||
---
|
||
|
||
## Resolution
|
||
|
||
1. `docker stop ewa-apps` (released .3)
|
||
2. `docker start traefik` (claimed .3 via IPAMConfig)
|
||
3. `docker start ewa-apps` (got .4)
|
||
4. Logs immediately showed Authentik/Traefik routing chains resuming.
|
||
|
||
---
|
||
|
||
## Preventive Fix
|
||
|
||
Pinned static IP for `ewa-apps` in its Dockge compose:
|
||
|
||
```yaml
|
||
networks:
|
||
dockerproxy:
|
||
ipv4_address: 172.18.0.70
|
||
```
|
||
|
||
Backup of original at `/mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17`.
|
||
|
||
Final state after `docker compose up -d`:
|
||
|
||
| Container | IP |
|
||
|-----------|-----|
|
||
| traefik | 172.18.0.3 |
|
||
| ewa-apps | 172.18.0.70 |
|
||
|
||
---
|
||
|
||
## Structural Fix — IPAM Redesign (~05:15 UTC same day)
|
||
|
||
Compose-level pinning fixes one container at a time. To remove the root cause class, the `dockerproxy` network was recreated with a dedicated dynamic IP range. Since Docker network IPAM is immutable, the network had to be torn down and rebuilt with all 32 containers offline (~3 min outage).
|
||
|
||
New IPAM:
|
||
|
||
```
|
||
Subnet: 172.18.0.0/16
|
||
Gateway: 172.18.0.1
|
||
IPRange: 172.18.0.128/25 ← Docker only auto-assigns from .128–.255
|
||
```
|
||
|
||
- `.2`–`.127` is now a **static-only reservation block**
|
||
- `.128`–`.255` is the dynamic pool
|
||
|
||
Procedure (scripted; full state snapshot at `/root/dockerproxy-recreate-2026-05-17/` on Unraid):
|
||
|
||
1. Capture container→static-IP map from `docker inspect`
|
||
2. `docker stop` all 32 in parallel
|
||
3. `docker network disconnect dockerproxy <each>`
|
||
4. `docker network rm dockerproxy`
|
||
5. `docker network create --driver bridge --subnet 172.18.0.0/16 --gateway 172.18.0.1 --ip-range 172.18.0.128/25 dockerproxy`
|
||
6. `docker network connect --ip <preserved-static> dockerproxy <each>` (no `--ip` for the one dynamic container, `traefik-manager`)
|
||
7. Start in dependency order: dockersocket → postgresql17 + Redis → traefik + traefik-manager → authentik + authentik-worker → rest
|
||
|
||
Final state: all 32 containers up, all static IPs preserved (.3–.70), traefik-manager moved from `.2` to `.128` (first dynamic slot).
|
||
|
||
Reference: `docs/13-DOCKERPROXY-NETWORK.md` documents the network spec, recreate command, and current IP assignments.
|
||
|
||
## Follow-up
|
||
|
||
- IPAM redesign eliminates the auto-allocation collision class. Future stacks that omit `ipv4_address` land safely in `.128+`.
|
||
- Network is imperative (not in any compose); recreate command is documented in `docs/13-DOCKERPROXY-NETWORK.md` for disaster recovery.
|
||
- Add an Uptime Kuma docker-state monitor for `traefik` (HTTP probes route through Traefik and fail uselessly when Traefik is the outage).
|