Files
infrastructure/docs/02-SERVICES-CRITICAL.md
Kaloyan Danchev ecbce1ca94
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add VRRP failover infrastructure documentation (Nobara)
Deployed automatic failover for critical services (Traefik, Vaultwarden,
Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP
with VIP 192.168.10.250. ~4 second failover time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:03:26 +02:00

5.0 KiB

Critical Services

Last Updated: 2026-01-31

Services that must remain operational for network functionality and security.


Priority Levels

Priority Meaning Recovery Target
P0 Network offline without this < 5 minutes
P1 Major functionality impacted < 1 hour

P0 - Network Core

DNS (AdGuard Home)

Instance Host IP Role
Primary HAP1 (container) 172.17.0.2 Main DNS
Secondary XTRM-U (macvlan) 192.168.10.10 Failover DNS

Failover: Automatic via Netwatch (ping + DNS resolution checks)

Config Sync: adguardhome-sync (every 30 min, Unraid → MikroTik)

Upstream: Quad9 DoH (https://dns.quad9.net/dns-query)

Web UI:

Recovery:

  1. If primary fails → automatic failover to secondary (192.168.10.10)
  2. Manual restart: /container start [find name~"adguard"]

Routing (HAP1)

Function Details
WAN 62.73.120.142 via Vivacom fiber
VLANs 10 (Mgmt), 20 (Trusted), 25 (Kids), 30 (IoT), 40 (CatchAll)
NAT Port forwarding to XTRM-U (192.168.10.20)
Firewall RouterOS firewall rules

Recovery:

  1. Physical access to HAP1
  2. Reset: hold reset button 5s
  3. Reconfigure via WinBox or SSH (port 2222)

DHCP (HAP1)

VLAN Pool Range
10 (Mgmt) pool-vlan10 192.168.10.100-200
20 (Trusted) pool-vlan20 192.168.20.100-200
25 (Kids) pool-vlan25 192.168.25.100-200
30 (IoT) pool-vlan30 192.168.30.100-200
40 (CatchAll) dhcp 192.168.1.10-254

Lease Time: 30 minutes


P1 - Authentication & Secrets

Authentik (SSO)

Component IP Purpose
authentik 172.18.0.11 Web UI + OIDC
authentik-worker 172.18.0.12 Background tasks

URL: https://auth.xtrm-lab.org

Protects:

  • Traefik forward auth (all *.xtrm-lab.org)
  • Gitea OAuth
  • Woodpecker OAuth
  • NetBox OAuth
  • NetDisco SSO

Recovery:

cd /mnt/user/appdata/authentik
docker compose up -d

Database: postgresql17 (authentik_db)


Vaultwarden (Passwords)

Component IP Purpose
vaultwarden 172.18.0.15 Password manager

URL: https://vault.xtrm-lab.org

Data: /mnt/user/appdata/vaultwarden/

Recovery:

docker start vaultwarden

Backup: Part of Unraid flash backup


P1 - Reverse Proxy

Traefik

Component IP Ports
traefik 172.18.0.3 8001→80, 44301→443

Config: /mnt/user/appdata/traefik/

  • traefik.yml - Static config
  • dynamic.yml - Routers & services

TLS: Let's Encrypt wildcard for *.xtrm-lab.org

Recovery:

docker start traefik

Shared Infrastructure

PostgreSQL 17

IP Databases
172.18.0.13 authentik_db, netbox, gitea, netdisco_db, diode, hydra

Data: /mnt/user/appdata/postgresql17/

Recovery:

docker start postgresql17
# Wait for DB to be ready before starting dependents

Redis

IP Consumers
172.18.0.14 Authentik, NetBox, Diode

Recovery:

docker start Redis

Startup Order

When recovering from full outage:

  1. postgresql17 - Database (wait 30s)
  2. Redis - Cache/queue (wait 10s)
  3. traefik - Reverse proxy
  4. authentik + authentik-worker - SSO
  5. vaultwarden - Passwords
  6. All other services

Monitoring

Uptime Kuma

URL Monitors
https://uptime.xtrm-lab.org 27 services

Alerts: Configured per service (email/webhook)


Backup Strategy

Data Location Frequency
Unraid Flash Google Drive Daily
PostgreSQL /mnt/user/Backup/ Daily
Vaultwarden Unraid Flash With flash backup
Authentik PostgreSQL + /mnt/user/appdata/authentik/ Daily

Active Failover: XTRM-Nobara

Critical services are replicated on the Nobara workstation with automatic VRRP failover:

Service Primary (XTRM-U) Failover (XTRM-Nobara)
Traefik 192.168.10.20 192.168.10.103
Vaultwarden 192.168.10.20 192.168.10.103
Authentik 192.168.10.20 192.168.10.103
AdGuard Home 192.168.10.20 192.168.10.103

VIP: 192.168.10.250 (floats between XTRM-U and XTRM-Nobara via Keepalived VRRP)

Failover time: ~4 seconds

See: 10-FAILOVER-NOBARA.md for full documentation.


Future: XTRM-N1 Survival Node

When hardware upgrade completes, these services will have replicas on XTRM-N1:

Service Primary Replica
DNS HAP1 XTRM-N1
Vaultwarden XTRM-N5 XTRM-N1
Authentik XTRM-N5 XTRM-N1

See: wip/UPGRADE-2026-HARDWARE.md