Files
infrastructure/docs/incidents/2026-05-17-traefik-ip-collision.md
T

2.4 KiB
Raw Blame History

Incident: Traefik IP Collision on dockerproxy Network

Date: 2026-05-17 (root cause 04:0204:39 UTC) Severity: P1 — full reverse-proxy outage Status: Resolved Affected: all *.xtrm-lab.org services routed through Traefik (vaultwarden, authentik, gitea, uptime-kuma, transmission, urbackup, unimus, netalert, openbrain, ~70 others)


Symptoms

  • traefik container stuck in Created state, never reached Running.
  • docker inspect traefik reported:
    • State.Status: created
    • State.Error: failed to set up container networking: Address already in use
    • State.ExitCode: 128
  • All *.xtrm-lab.org hostnames unreachable through the proxy.
  • traefik-manager sidecar remained healthy but isolated.

Root Cause

Static-IP collision on the dockerproxy Docker bridge (172.18.0.0/16).

Timeline (UTC):

  • 04:02:40ewa-apps (Dockge stack at /mnt/user/appdata/dockge/stacks/ewa-apps) restarted. Its compose joined dockerproxy without ipv4_address, so Docker handed it the lowest free IP: 172.18.0.3 (Traefik's reserved IP, but Traefik was momentarily down).
  • 04:39:09traefik was recreated and requested its IPAMConfig-reserved 172.18.0.3. Already taken — container left in Created.

dockerproxy is a user-defined bridge where critical services have hard-coded static IPs (traefik=.3, postgresql17=.10, vaultwarden=.15, etc.) but several stacks (ewa-apps) had no reservation. First-to-start wins the IP race.


Resolution

  1. docker stop ewa-apps (released .3)
  2. docker start traefik (claimed .3 via IPAMConfig)
  3. docker start ewa-apps (got .4)
  4. Logs immediately showed Authentik/Traefik routing chains resuming.

Preventive Fix

Pinned static IP for ewa-apps in its Dockge compose:

networks:
  dockerproxy:
    ipv4_address: 172.18.0.70

Backup of original at /mnt/user/appdata/dockge/stacks/ewa-apps/compose.yaml.bak.2026-05-17.

Final state after docker compose up -d:

Container IP
traefik 172.18.0.3
ewa-apps 172.18.0.70

Follow-up

  • Audit remaining Dockge stacks on dockerproxy for missing ipv4_address reservations.
  • Add a Uptime Kuma docker-state monitor for traefik (HTTP probes route through Traefik and fail uselessly when Traefik itself is the outage).
  • Two Traefik-IP incidents in 12 days (also 2026-05-05 vaultwarden misroute) — consider a deliberate redesign of the dockerproxy IP map.