Recreated dockerproxy network with --ip-range 172.18.0.128/25 so Docker auto-allocations are isolated from the .2-.127 static reservation block. Eliminates IP-collision class that caused the 2026-05-17 Traefik outage. Adds 13-DOCKERPROXY-NETWORK.md as the canonical reference for the network spec, recreate command, and current IP assignments.
3.9 KiB
Incident: Traefik IP Collision on dockerproxy Network
Date: 2026-05-17 (root cause 04:02–04:39 UTC)
Severity: P1 — full reverse-proxy outage
Status: Resolved
Affected: all *.xtrm-lab.org services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)
Symptoms
traefikcontainer stuck inCreatedstate, never reachedRunning.docker inspect traefikreported:State.Status: createdState.Error: failed to set up container networking: Address already in useState.ExitCode: 128
- All
*.xtrm-lab.orghostnames unreachable through the proxy. traefik-managersidecar remained healthy but isolated.
Root Cause
Static-IP collision on the dockerproxy Docker bridge (172.18.0.0/16).
Timeline (UTC):
- 04:02:40 —
ewa-apps(Dockge stack at/mnt/user/appdata/dockge/stacks/ewa-apps) restarted. Its compose joineddockerproxywithoutipv4_address, so Docker handed it the lowest free IP: 172.18.0.3 (Traefik's reserved IP, but Traefik was momentarily down). - 04:39:09 —
traefikwas recreated and requested its IPAMConfig-reserved172.18.0.3. Already taken — container left inCreated.
dockerproxy is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (ewa-apps) had no reservation. First-to-start wins the IP race.
Resolution
docker stop ewa-apps(released .3)docker start traefik(claimed .3 via IPAMConfig)docker start ewa-apps(got .4)- Logs immediately showed Authentik/Traefik routing chains resuming.
Preventive Fix
Pinned static IP for ewa-apps in its Dockge compose:
networks:
dockerproxy:
ipv4_address: 172.18.0.70
Backup of original at /mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17.
Final state after docker compose up -d:
| Container | IP |
|---|---|
| traefik | 172.18.0.3 |
| ewa-apps | 172.18.0.70 |
Structural Fix — IPAM Redesign (~05:15 UTC same day)
Compose-level pinning fixes one container at a time. To remove the root cause class, the dockerproxy network was recreated with a dedicated dynamic IP range. Since Docker network IPAM is immutable, the network had to be torn down and rebuilt with all 32 containers offline (~3 min outage).
New IPAM:
Subnet: 172.18.0.0/16
Gateway: 172.18.0.1
IPRange: 172.18.0.128/25 ← Docker only auto-assigns from .128–.255
.2–.127is now a static-only reservation block.128–.255is the dynamic pool
Procedure (scripted; full state snapshot at /root/dockerproxy-recreate-2026-05-17/ on Unraid):
- Capture container→static-IP map from
docker inspect docker stopall 32 in paralleldocker network disconnect dockerproxy <each>docker network rm dockerproxydocker network create --driver bridge --subnet 172.18.0.0/16 --gateway 172.18.0.1 --ip-range 172.18.0.128/25 dockerproxydocker network connect --ip <preserved-static> dockerproxy <each>(no--ipfor the one dynamic container,traefik-manager)- Start in dependency order: dockersocket → postgresql17 + Redis → traefik + traefik-manager → authentik + authentik-worker → rest
Final state: all 32 containers up, all static IPs preserved (.3–.70), traefik-manager moved from .2 to .128 (first dynamic slot).
Reference: docs/13-DOCKERPROXY-NETWORK.md documents the network spec, recreate command, and current IP assignments.
Follow-up
- IPAM redesign eliminates the auto-allocation collision class. Future stacks that omit
ipv4_addressland safely in.128+. - Network is imperative (not in any compose); recreate command is documented in
docs/13-DOCKERPROXY-NETWORK.mdfor disaster recovery. - Add an Uptime Kuma docker-state monitor for
traefik(HTTP probes route through Traefik and fail uselessly when Traefik is the outage).