Files
infrastructure/docs/10-FAILOVER-NOBARA.md
Kaloyan Danchev ecbce1ca94
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add VRRP failover infrastructure documentation (Nobara)
Deployed automatic failover for critical services (Traefik, Vaultwarden,
Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP
with VIP 192.168.10.250. ~4 second failover time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:03:26 +02:00

9.1 KiB

Failover Infrastructure - Nobara (XTRM-Nobara)

Last Updated: 2026-02-13

Purpose: Temporary failover for critical services during Unraid maintenance windows.


Overview

A Docker-based replica of critical services runs on the Nobara Linux workstation (XTRM-Nobara) with automatic failover via Keepalived VRRP. When Unraid goes offline, the virtual IP floats to Nobara and services continue operating.

Clients → 192.168.10.250 (VIP) → XTRM-U (MASTER, priority 150)
                                    ↓ failover (~4 seconds)
                                  XTRM-Nobara (BACKUP, priority 100)

Machines

Role Host IP Interface Priority
MASTER XTRM-U (Unraid) 192.168.10.20 br0 150
BACKUP XTRM-Nobara 192.168.10.103 enp5s0 100
VIP Shared 192.168.10.250

Replicated Services

Service Image Ports (Nobara) Domain
Traefik traefik:latest 80, 443, 8080 *.xtrm-lab.org
Vaultwarden vaultwarden/server:latest internal:80 vault.xtrm-lab.org
Authentik ghcr.io/goauthentik/server:2025.8.1 internal:9000 auth.xtrm-lab.org
Authentik Worker ghcr.io/goauthentik/server:2025.8.1
PostgreSQL postgres:17 internal:5432
Redis redis:7-alpine internal:6379
AdGuard Home adguard/adguardhome:latest 192.168.10.103:53, 3000

File Locations

Nobara (XTRM-Nobara)

Path Contents
/home/failover/docker-compose.yml Main compose stack
/home/failover/traefik/ Traefik config, certs, acme.json
/home/failover/vaultwarden/ Vaultwarden data (copy from Unraid)
/home/failover/authentik/ Authentik media & templates
/home/failover/postgres/ PostgreSQL data + initial dump
/home/failover/redis/ Redis data
/home/failover/adguard/ AdGuard conf & work dirs
/etc/keepalived/keepalived.conf Keepalived VRRP config
/usr/local/bin/check_failover.sh Health check script
/usr/local/bin/failover-notify.sh State change notification script
/var/log/keepalived-failover.log Failover event log

Unraid (XTRM-U)

Path Contents
/mnt/user/appdata/keepalived/keepalived.conf Keepalived VRRP config
/mnt/user/appdata/keepalived/check_services.sh Health check script

Keepalived Configuration

VRRP Parameters

Parameter Value
Virtual Router ID 51
Auth Type PASS
Auth Password xtrm2026
Advertisement Interval 1 second
Health Check Interval 5 seconds
Fail Threshold 3 missed checks
Recovery Threshold 2 successful checks

Unraid (MASTER)

  • Runs as Docker container: local/keepalived (built from alpine + keepalived + curl)
  • Priority: 150 (+ health check weight 2 = 152 when healthy)
  • Health check: curls http://localhost:8183/api/overview (Traefik dashboard)
  • Preemption: enabled (will reclaim VIP from Nobara when healthy)
# Start/stop on Unraid
docker start keepalived
docker stop keepalived
docker logs keepalived

Nobara (BACKUP)

  • Runs as systemd service: keepalived.service
  • Priority: 100 (+ health check weight 2 = 102 when healthy)
  • Health check: verifies Traefik and Vaultwarden containers are running
  • nopreempt set (won't fight for VIP if Unraid is healthy)
# Start/stop on Nobara
sudo systemctl start keepalived
sudo systemctl stop keepalived
sudo journalctl -u keepalived -f

DNS Strategy

Approach: Local DNS override via AdGuard Home.

To route traffic through the VIP for internal clients, configure AdGuard DNS rewrite rules to resolve *.xtrm-lab.org192.168.10.250. External (Cloudflare) DNS remains pointed at Unraid's public IP.


Operations

Before Maintenance (Data Sync)

Run these commands from the Mac to sync latest data to Nobara:

# 1. Sync Vaultwarden data
ssh unraid "tar czf - -C /mnt/user/appdata vaultwarden/" | \
  ssh nobara "tar xzf - -C /home/failover/"

# 2. Dump and sync Authentik database
ssh unraid "docker exec postgresql17 pg_dump -U authentik_user authentik_db" | \
  ssh nobara "cat > /home/failover/postgres/authentik_dump.sql"

# 3. Sync AdGuard config
ssh unraid "tar czf - -C /mnt/user/appdata/adguardhome conf/ work/" | \
  ssh nobara "tar xzf - -C /home/failover/adguard/"

# 4. Sync Traefik config and certs
ssh unraid "tar czf - -C /mnt/user/appdata/traefik traefik.yml dynamic.yml acme.json certs/" | \
  ssh nobara "tar xzf - -C /home/failover/traefik/"

Note: ssh unraid = ssh -i ~/.ssh/id_ed25519_unraid -p 422 root@192.168.10.20

Start Failover Services

# On Nobara
cd /home/failover
sudo docker compose up -d
sudo systemctl start keepalived

Stop Failover Services

# On Nobara
cd /home/failover
sudo docker compose down
sudo systemctl stop keepalived

Test Failover

# 1. Check VIP location
ssh unraid "ip addr show br0 | grep inet"
ssh nobara "ip addr show enp5s0 | grep inet"

# 2. Simulate Unraid failure
ssh unraid "docker stop keepalived"

# 3. Verify VIP moved to Nobara (wait ~4 seconds)
ssh nobara "ip addr show enp5s0 | grep inet"

# 4. Restore Unraid
ssh unraid "docker start keepalived"

# 5. Verify VIP returned to Unraid
ssh unraid "ip addr show br0 | grep inet"

Check Status

# Nobara service status
ssh nobara "sudo docker ps --format 'table {{.Names}}\t{{.Status}}'"

# Nobara keepalived state
ssh nobara "sudo journalctl -u keepalived -n 10 --no-pager"

# Unraid keepalived state
ssh unraid "docker logs keepalived --tail 10"

# Which machine holds the VIP?
ping -c 1 192.168.10.250

Traefik Configuration (Failover)

The Nobara Traefik instance has a reduced dynamic.yml that only serves the four critical services:

Router Domain Backend
vaultwarden-secure vault.xtrm-lab.org http://vaultwarden:80
authentik-secure auth.xtrm-lab.org http://authentik:9000
traefik-secure traefik.xtrm-lab.org api@internal

TLS certificates are shared (copied from Unraid's acme.json + static certs).


Limitations

  • Data is a point-in-time snapshot. Changes made on Unraid after the last sync are not reflected on Nobara. Re-sync before maintenance.
  • No real-time replication. Vaultwarden passwords saved during failover will not sync back to Unraid automatically.
  • Only critical services replicated. Other services (Plex, Gitea, NetBox, etc.) will be offline during maintenance.
  • External DNS not updated. Failover only works for clients using the local DNS (AdGuard) that resolves to the VIP. External access via Cloudflare will not failover.

SSH Access

# From Mac to Nobara (passwordless, key-based)
ssh nobara
# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103

# Sudo on Nobara requires password: (check password manager)

Recovery After Maintenance

  1. Bring Unraid back online
  2. Verify all Unraid services are running: docker ps
  3. Keepalived on Unraid will auto-reclaim VIP (preemption)
  4. Stop failover on Nobara: cd /home/failover && sudo docker compose down
  5. If Vaultwarden was used during failover, manually export/import any new entries

Architecture Diagram

                    ┌─────────────────────┐
                    │   192.168.10.250    │
                    │     (VRRP VIP)      │
                    └─────────┬───────────┘
                              │
              ┌───────────────┼───────────────┐
              │                               │
    ┌─────────▼─────────┐          ┌─────────▼─────────┐
    │   XTRM-U (Unraid) │          │  XTRM-Nobara      │
    │   192.168.10.20    │          │  192.168.10.103    │
    │   MASTER (150)     │          │  BACKUP (100)      │
    │                    │          │                    │
    │  ┌──────────────┐  │          │  ┌──────────────┐  │
    │  │  Traefik     │  │          │  │  Traefik     │  │
    │  │  Vaultwarden │  │          │  │  Vaultwarden │  │
    │  │  Authentik   │  │          │  │  Authentik   │  │
    │  │  AdGuard     │  │          │  │  AdGuard     │  │
    │  │  + 25 more   │  │          │  │  PostgreSQL  │  │
    │  └──────────────┘  │          │  │  Redis       │  │
    │                    │          │  └──────────────┘  │
    │  Keepalived (Docker)│         │  Keepalived (systemd)│
    └────────────────────┘          └────────────────────┘