Files
infrastructure/docs/incidents/2026-05-17-traefik-ip-collision.md
jazzymc dd1c15cf6b dockerproxy: redesign IPAM with static block + dynamic /25 pool
Recreated dockerproxy network with --ip-range 172.18.0.128/25 so Docker
auto-allocations are isolated from the .2-.127 static reservation block.
Eliminates IP-collision class that caused the 2026-05-17 Traefik outage.

Adds 13-DOCKERPROXY-NETWORK.md as the canonical reference for the
network spec, recreate command, and current IP assignments.
2026-05-17 08:36:46 +03:00

3.9 KiB
Raw Permalink Blame History

Incident: Traefik IP Collision on dockerproxy Network

Date: 2026-05-17 (root cause 04:0204:39 UTC) Severity: P1 — full reverse-proxy outage Status: Resolved Affected: all *.xtrm-lab.org services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)


Symptoms

  • traefik container stuck in Created state, never reached Running.
  • docker inspect traefik reported:
    • State.Status: created
    • State.Error: failed to set up container networking: Address already in use
    • State.ExitCode: 128
  • All *.xtrm-lab.org hostnames unreachable through the proxy.
  • traefik-manager sidecar remained healthy but isolated.

Root Cause

Static-IP collision on the dockerproxy Docker bridge (172.18.0.0/16).

Timeline (UTC):

  • 04:02:40ewa-apps (Dockge stack at /mnt/user/appdata/dockge/stacks/ewa-apps) restarted. Its compose joined dockerproxy without ipv4_address, so Docker handed it the lowest free IP: 172.18.0.3 (Traefik's reserved IP, but Traefik was momentarily down).
  • 04:39:09traefik was recreated and requested its IPAMConfig-reserved 172.18.0.3. Already taken — container left in Created.

dockerproxy is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (ewa-apps) had no reservation. First-to-start wins the IP race.


Resolution

  1. docker stop ewa-apps (released .3)
  2. docker start traefik (claimed .3 via IPAMConfig)
  3. docker start ewa-apps (got .4)
  4. Logs immediately showed Authentik/Traefik routing chains resuming.

Preventive Fix

Pinned static IP for ewa-apps in its Dockge compose:

networks:
  dockerproxy:
    ipv4_address: 172.18.0.70

Backup of original at /mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17.

Final state after docker compose up -d:

Container IP
traefik 172.18.0.3
ewa-apps 172.18.0.70

Structural Fix — IPAM Redesign (~05:15 UTC same day)

Compose-level pinning fixes one container at a time. To remove the root cause class, the dockerproxy network was recreated with a dedicated dynamic IP range. Since Docker network IPAM is immutable, the network had to be torn down and rebuilt with all 32 containers offline (~3 min outage).

New IPAM:

Subnet:   172.18.0.0/16
Gateway:  172.18.0.1
IPRange:  172.18.0.128/25   ← Docker only auto-assigns from .128.255
  • .2.127 is now a static-only reservation block
  • .128.255 is the dynamic pool

Procedure (scripted; full state snapshot at /root/dockerproxy-recreate-2026-05-17/ on Unraid):

  1. Capture container→static-IP map from docker inspect
  2. docker stop all 32 in parallel
  3. docker network disconnect dockerproxy <each>
  4. docker network rm dockerproxy
  5. docker network create --driver bridge --subnet 172.18.0.0/16 --gateway 172.18.0.1 --ip-range 172.18.0.128/25 dockerproxy
  6. docker network connect --ip <preserved-static> dockerproxy <each> (no --ip for the one dynamic container, traefik-manager)
  7. Start in dependency order: dockersocket → postgresql17 + Redis → traefik + traefik-manager → authentik + authentik-worker → rest

Final state: all 32 containers up, all static IPs preserved (.3.70), traefik-manager moved from .2 to .128 (first dynamic slot).

Reference: docs/13-DOCKERPROXY-NETWORK.md documents the network spec, recreate command, and current IP assignments.

Follow-up

  • IPAM redesign eliminates the auto-allocation collision class. Future stacks that omit ipv4_address land safely in .128+.
  • Network is imperative (not in any compose); recreate command is documented in docs/13-DOCKERPROXY-NETWORK.md for disaster recovery.
  • Add an Uptime Kuma docker-state monitor for traefik (HTTP probes route through Traefik and fail uselessly when Traefik is the outage).