Files
infrastructure/docs/10-FAILOVER-NOBARA.md
Kaloyan Danchev ecbce1ca94
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add VRRP failover infrastructure documentation (Nobara)
Deployed automatic failover for critical services (Traefik, Vaultwarden,
Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP
with VIP 192.168.10.250. ~4 second failover time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:03:26 +02:00

277 lines
9.1 KiB
Markdown

# Failover Infrastructure - Nobara (XTRM-Nobara)
**Last Updated:** 2026-02-13
**Purpose:** Temporary failover for critical services during Unraid maintenance windows.
---
## Overview
A Docker-based replica of critical services runs on the Nobara Linux workstation (XTRM-Nobara) with automatic failover via Keepalived VRRP. When Unraid goes offline, the virtual IP floats to Nobara and services continue operating.
```
Clients → 192.168.10.250 (VIP) → XTRM-U (MASTER, priority 150)
↓ failover (~4 seconds)
XTRM-Nobara (BACKUP, priority 100)
```
---
## Machines
| Role | Host | IP | Interface | Priority |
|------|------|-----|-----------|----------|
| **MASTER** | XTRM-U (Unraid) | 192.168.10.20 | br0 | 150 |
| **BACKUP** | XTRM-Nobara | 192.168.10.103 | enp5s0 | 100 |
| **VIP** | Shared | 192.168.10.250 | — | — |
---
## Replicated Services
| Service | Image | Ports (Nobara) | Domain |
|---------|-------|----------------|--------|
| **Traefik** | traefik:latest | 80, 443, 8080 | *.xtrm-lab.org |
| **Vaultwarden** | vaultwarden/server:latest | internal:80 | vault.xtrm-lab.org |
| **Authentik** | ghcr.io/goauthentik/server:2025.8.1 | internal:9000 | auth.xtrm-lab.org |
| **Authentik Worker** | ghcr.io/goauthentik/server:2025.8.1 | — | — |
| **PostgreSQL** | postgres:17 | internal:5432 | — |
| **Redis** | redis:7-alpine | internal:6379 | — |
| **AdGuard Home** | adguard/adguardhome:latest | 192.168.10.103:53, 3000 | — |
---
## File Locations
### Nobara (XTRM-Nobara)
| Path | Contents |
|------|----------|
| `/home/failover/docker-compose.yml` | Main compose stack |
| `/home/failover/traefik/` | Traefik config, certs, acme.json |
| `/home/failover/vaultwarden/` | Vaultwarden data (copy from Unraid) |
| `/home/failover/authentik/` | Authentik media & templates |
| `/home/failover/postgres/` | PostgreSQL data + initial dump |
| `/home/failover/redis/` | Redis data |
| `/home/failover/adguard/` | AdGuard conf & work dirs |
| `/etc/keepalived/keepalived.conf` | Keepalived VRRP config |
| `/usr/local/bin/check_failover.sh` | Health check script |
| `/usr/local/bin/failover-notify.sh` | State change notification script |
| `/var/log/keepalived-failover.log` | Failover event log |
### Unraid (XTRM-U)
| Path | Contents |
|------|----------|
| `/mnt/user/appdata/keepalived/keepalived.conf` | Keepalived VRRP config |
| `/mnt/user/appdata/keepalived/check_services.sh` | Health check script |
---
## Keepalived Configuration
### VRRP Parameters
| Parameter | Value |
|-----------|-------|
| Virtual Router ID | 51 |
| Auth Type | PASS |
| Auth Password | xtrm2026 |
| Advertisement Interval | 1 second |
| Health Check Interval | 5 seconds |
| Fail Threshold | 3 missed checks |
| Recovery Threshold | 2 successful checks |
### Unraid (MASTER)
- Runs as Docker container: `local/keepalived` (built from alpine + keepalived + curl)
- Priority: 150 (+ health check weight 2 = 152 when healthy)
- Health check: curls `http://localhost:8183/api/overview` (Traefik dashboard)
- Preemption: enabled (will reclaim VIP from Nobara when healthy)
```bash
# Start/stop on Unraid
docker start keepalived
docker stop keepalived
docker logs keepalived
```
### Nobara (BACKUP)
- Runs as systemd service: `keepalived.service`
- Priority: 100 (+ health check weight 2 = 102 when healthy)
- Health check: verifies Traefik and Vaultwarden containers are running
- `nopreempt` set (won't fight for VIP if Unraid is healthy)
```bash
# Start/stop on Nobara
sudo systemctl start keepalived
sudo systemctl stop keepalived
sudo journalctl -u keepalived -f
```
---
## DNS Strategy
**Approach:** Local DNS override via AdGuard Home.
To route traffic through the VIP for internal clients, configure AdGuard DNS rewrite rules to resolve `*.xtrm-lab.org``192.168.10.250`. External (Cloudflare) DNS remains pointed at Unraid's public IP.
---
## Operations
### Before Maintenance (Data Sync)
Run these commands from the Mac to sync latest data to Nobara:
```bash
# 1. Sync Vaultwarden data
ssh unraid "tar czf - -C /mnt/user/appdata vaultwarden/" | \
ssh nobara "tar xzf - -C /home/failover/"
# 2. Dump and sync Authentik database
ssh unraid "docker exec postgresql17 pg_dump -U authentik_user authentik_db" | \
ssh nobara "cat > /home/failover/postgres/authentik_dump.sql"
# 3. Sync AdGuard config
ssh unraid "tar czf - -C /mnt/user/appdata/adguardhome conf/ work/" | \
ssh nobara "tar xzf - -C /home/failover/adguard/"
# 4. Sync Traefik config and certs
ssh unraid "tar czf - -C /mnt/user/appdata/traefik traefik.yml dynamic.yml acme.json certs/" | \
ssh nobara "tar xzf - -C /home/failover/traefik/"
```
**Note:** `ssh unraid` = `ssh -i ~/.ssh/id_ed25519_unraid -p 422 root@192.168.10.20`
### Start Failover Services
```bash
# On Nobara
cd /home/failover
sudo docker compose up -d
sudo systemctl start keepalived
```
### Stop Failover Services
```bash
# On Nobara
cd /home/failover
sudo docker compose down
sudo systemctl stop keepalived
```
### Test Failover
```bash
# 1. Check VIP location
ssh unraid "ip addr show br0 | grep inet"
ssh nobara "ip addr show enp5s0 | grep inet"
# 2. Simulate Unraid failure
ssh unraid "docker stop keepalived"
# 3. Verify VIP moved to Nobara (wait ~4 seconds)
ssh nobara "ip addr show enp5s0 | grep inet"
# 4. Restore Unraid
ssh unraid "docker start keepalived"
# 5. Verify VIP returned to Unraid
ssh unraid "ip addr show br0 | grep inet"
```
### Check Status
```bash
# Nobara service status
ssh nobara "sudo docker ps --format 'table {{.Names}}\t{{.Status}}'"
# Nobara keepalived state
ssh nobara "sudo journalctl -u keepalived -n 10 --no-pager"
# Unraid keepalived state
ssh unraid "docker logs keepalived --tail 10"
# Which machine holds the VIP?
ping -c 1 192.168.10.250
```
---
## Traefik Configuration (Failover)
The Nobara Traefik instance has a **reduced** dynamic.yml that only serves the four critical services:
| Router | Domain | Backend |
|--------|--------|---------|
| vaultwarden-secure | vault.xtrm-lab.org | http://vaultwarden:80 |
| authentik-secure | auth.xtrm-lab.org | http://authentik:9000 |
| traefik-secure | traefik.xtrm-lab.org | api@internal |
TLS certificates are shared (copied from Unraid's acme.json + static certs).
---
## Limitations
- **Data is a point-in-time snapshot.** Changes made on Unraid after the last sync are not reflected on Nobara. Re-sync before maintenance.
- **No real-time replication.** Vaultwarden passwords saved during failover will not sync back to Unraid automatically.
- **Only critical services replicated.** Other services (Plex, Gitea, NetBox, etc.) will be offline during maintenance.
- **External DNS not updated.** Failover only works for clients using the local DNS (AdGuard) that resolves to the VIP. External access via Cloudflare will not failover.
---
## SSH Access
```bash
# From Mac to Nobara (passwordless, key-based)
ssh nobara
# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103
# Sudo on Nobara requires password: (check password manager)
```
---
## Recovery After Maintenance
1. Bring Unraid back online
2. Verify all Unraid services are running: `docker ps`
3. Keepalived on Unraid will auto-reclaim VIP (preemption)
4. Stop failover on Nobara: `cd /home/failover && sudo docker compose down`
5. If Vaultwarden was used during failover, manually export/import any new entries
---
## Architecture Diagram
```
┌─────────────────────┐
│ 192.168.10.250 │
│ (VRRP VIP) │
└─────────┬───────────┘
┌───────────────┼───────────────┐
│ │
┌─────────▼─────────┐ ┌─────────▼─────────┐
│ XTRM-U (Unraid) │ │ XTRM-Nobara │
│ 192.168.10.20 │ │ 192.168.10.103 │
│ MASTER (150) │ │ BACKUP (100) │
│ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Traefik │ │ │ │ Traefik │ │
│ │ Vaultwarden │ │ │ │ Vaultwarden │ │
│ │ Authentik │ │ │ │ Authentik │ │
│ │ AdGuard │ │ │ │ AdGuard │ │
│ │ + 25 more │ │ │ │ PostgreSQL │ │
│ └──────────────┘ │ │ │ Redis │ │
│ │ │ └──────────────┘ │
│ Keepalived (Docker)│ │ Keepalived (systemd)│
└────────────────────┘ └────────────────────┘
```