All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Deployed automatic failover for critical services (Traefik, Vaultwarden, Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP with VIP 192.168.10.250. ~4 second failover time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
277 lines
9.1 KiB
Markdown
277 lines
9.1 KiB
Markdown
# Failover Infrastructure - Nobara (XTRM-Nobara)
|
|
|
|
**Last Updated:** 2026-02-13
|
|
|
|
**Purpose:** Temporary failover for critical services during Unraid maintenance windows.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
A Docker-based replica of critical services runs on the Nobara Linux workstation (XTRM-Nobara) with automatic failover via Keepalived VRRP. When Unraid goes offline, the virtual IP floats to Nobara and services continue operating.
|
|
|
|
```
|
|
Clients → 192.168.10.250 (VIP) → XTRM-U (MASTER, priority 150)
|
|
↓ failover (~4 seconds)
|
|
XTRM-Nobara (BACKUP, priority 100)
|
|
```
|
|
|
|
---
|
|
|
|
## Machines
|
|
|
|
| Role | Host | IP | Interface | Priority |
|
|
|------|------|-----|-----------|----------|
|
|
| **MASTER** | XTRM-U (Unraid) | 192.168.10.20 | br0 | 150 |
|
|
| **BACKUP** | XTRM-Nobara | 192.168.10.103 | enp5s0 | 100 |
|
|
| **VIP** | Shared | 192.168.10.250 | — | — |
|
|
|
|
---
|
|
|
|
## Replicated Services
|
|
|
|
| Service | Image | Ports (Nobara) | Domain |
|
|
|---------|-------|----------------|--------|
|
|
| **Traefik** | traefik:latest | 80, 443, 8080 | *.xtrm-lab.org |
|
|
| **Vaultwarden** | vaultwarden/server:latest | internal:80 | vault.xtrm-lab.org |
|
|
| **Authentik** | ghcr.io/goauthentik/server:2025.8.1 | internal:9000 | auth.xtrm-lab.org |
|
|
| **Authentik Worker** | ghcr.io/goauthentik/server:2025.8.1 | — | — |
|
|
| **PostgreSQL** | postgres:17 | internal:5432 | — |
|
|
| **Redis** | redis:7-alpine | internal:6379 | — |
|
|
| **AdGuard Home** | adguard/adguardhome:latest | 192.168.10.103:53, 3000 | — |
|
|
|
|
---
|
|
|
|
## File Locations
|
|
|
|
### Nobara (XTRM-Nobara)
|
|
|
|
| Path | Contents |
|
|
|------|----------|
|
|
| `/home/failover/docker-compose.yml` | Main compose stack |
|
|
| `/home/failover/traefik/` | Traefik config, certs, acme.json |
|
|
| `/home/failover/vaultwarden/` | Vaultwarden data (copy from Unraid) |
|
|
| `/home/failover/authentik/` | Authentik media & templates |
|
|
| `/home/failover/postgres/` | PostgreSQL data + initial dump |
|
|
| `/home/failover/redis/` | Redis data |
|
|
| `/home/failover/adguard/` | AdGuard conf & work dirs |
|
|
| `/etc/keepalived/keepalived.conf` | Keepalived VRRP config |
|
|
| `/usr/local/bin/check_failover.sh` | Health check script |
|
|
| `/usr/local/bin/failover-notify.sh` | State change notification script |
|
|
| `/var/log/keepalived-failover.log` | Failover event log |
|
|
|
|
### Unraid (XTRM-U)
|
|
|
|
| Path | Contents |
|
|
|------|----------|
|
|
| `/mnt/user/appdata/keepalived/keepalived.conf` | Keepalived VRRP config |
|
|
| `/mnt/user/appdata/keepalived/check_services.sh` | Health check script |
|
|
|
|
---
|
|
|
|
## Keepalived Configuration
|
|
|
|
### VRRP Parameters
|
|
|
|
| Parameter | Value |
|
|
|-----------|-------|
|
|
| Virtual Router ID | 51 |
|
|
| Auth Type | PASS |
|
|
| Auth Password | xtrm2026 |
|
|
| Advertisement Interval | 1 second |
|
|
| Health Check Interval | 5 seconds |
|
|
| Fail Threshold | 3 missed checks |
|
|
| Recovery Threshold | 2 successful checks |
|
|
|
|
### Unraid (MASTER)
|
|
|
|
- Runs as Docker container: `local/keepalived` (built from alpine + keepalived + curl)
|
|
- Priority: 150 (+ health check weight 2 = 152 when healthy)
|
|
- Health check: curls `http://localhost:8183/api/overview` (Traefik dashboard)
|
|
- Preemption: enabled (will reclaim VIP from Nobara when healthy)
|
|
|
|
```bash
|
|
# Start/stop on Unraid
|
|
docker start keepalived
|
|
docker stop keepalived
|
|
docker logs keepalived
|
|
```
|
|
|
|
### Nobara (BACKUP)
|
|
|
|
- Runs as systemd service: `keepalived.service`
|
|
- Priority: 100 (+ health check weight 2 = 102 when healthy)
|
|
- Health check: verifies Traefik and Vaultwarden containers are running
|
|
- `nopreempt` set (won't fight for VIP if Unraid is healthy)
|
|
|
|
```bash
|
|
# Start/stop on Nobara
|
|
sudo systemctl start keepalived
|
|
sudo systemctl stop keepalived
|
|
sudo journalctl -u keepalived -f
|
|
```
|
|
|
|
---
|
|
|
|
## DNS Strategy
|
|
|
|
**Approach:** Local DNS override via AdGuard Home.
|
|
|
|
To route traffic through the VIP for internal clients, configure AdGuard DNS rewrite rules to resolve `*.xtrm-lab.org` → `192.168.10.250`. External (Cloudflare) DNS remains pointed at Unraid's public IP.
|
|
|
|
---
|
|
|
|
## Operations
|
|
|
|
### Before Maintenance (Data Sync)
|
|
|
|
Run these commands from the Mac to sync latest data to Nobara:
|
|
|
|
```bash
|
|
# 1. Sync Vaultwarden data
|
|
ssh unraid "tar czf - -C /mnt/user/appdata vaultwarden/" | \
|
|
ssh nobara "tar xzf - -C /home/failover/"
|
|
|
|
# 2. Dump and sync Authentik database
|
|
ssh unraid "docker exec postgresql17 pg_dump -U authentik_user authentik_db" | \
|
|
ssh nobara "cat > /home/failover/postgres/authentik_dump.sql"
|
|
|
|
# 3. Sync AdGuard config
|
|
ssh unraid "tar czf - -C /mnt/user/appdata/adguardhome conf/ work/" | \
|
|
ssh nobara "tar xzf - -C /home/failover/adguard/"
|
|
|
|
# 4. Sync Traefik config and certs
|
|
ssh unraid "tar czf - -C /mnt/user/appdata/traefik traefik.yml dynamic.yml acme.json certs/" | \
|
|
ssh nobara "tar xzf - -C /home/failover/traefik/"
|
|
```
|
|
|
|
**Note:** `ssh unraid` = `ssh -i ~/.ssh/id_ed25519_unraid -p 422 root@192.168.10.20`
|
|
|
|
### Start Failover Services
|
|
|
|
```bash
|
|
# On Nobara
|
|
cd /home/failover
|
|
sudo docker compose up -d
|
|
sudo systemctl start keepalived
|
|
```
|
|
|
|
### Stop Failover Services
|
|
|
|
```bash
|
|
# On Nobara
|
|
cd /home/failover
|
|
sudo docker compose down
|
|
sudo systemctl stop keepalived
|
|
```
|
|
|
|
### Test Failover
|
|
|
|
```bash
|
|
# 1. Check VIP location
|
|
ssh unraid "ip addr show br0 | grep inet"
|
|
ssh nobara "ip addr show enp5s0 | grep inet"
|
|
|
|
# 2. Simulate Unraid failure
|
|
ssh unraid "docker stop keepalived"
|
|
|
|
# 3. Verify VIP moved to Nobara (wait ~4 seconds)
|
|
ssh nobara "ip addr show enp5s0 | grep inet"
|
|
|
|
# 4. Restore Unraid
|
|
ssh unraid "docker start keepalived"
|
|
|
|
# 5. Verify VIP returned to Unraid
|
|
ssh unraid "ip addr show br0 | grep inet"
|
|
```
|
|
|
|
### Check Status
|
|
|
|
```bash
|
|
# Nobara service status
|
|
ssh nobara "sudo docker ps --format 'table {{.Names}}\t{{.Status}}'"
|
|
|
|
# Nobara keepalived state
|
|
ssh nobara "sudo journalctl -u keepalived -n 10 --no-pager"
|
|
|
|
# Unraid keepalived state
|
|
ssh unraid "docker logs keepalived --tail 10"
|
|
|
|
# Which machine holds the VIP?
|
|
ping -c 1 192.168.10.250
|
|
```
|
|
|
|
---
|
|
|
|
## Traefik Configuration (Failover)
|
|
|
|
The Nobara Traefik instance has a **reduced** dynamic.yml that only serves the four critical services:
|
|
|
|
| Router | Domain | Backend |
|
|
|--------|--------|---------|
|
|
| vaultwarden-secure | vault.xtrm-lab.org | http://vaultwarden:80 |
|
|
| authentik-secure | auth.xtrm-lab.org | http://authentik:9000 |
|
|
| traefik-secure | traefik.xtrm-lab.org | api@internal |
|
|
|
|
TLS certificates are shared (copied from Unraid's acme.json + static certs).
|
|
|
|
---
|
|
|
|
## Limitations
|
|
|
|
- **Data is a point-in-time snapshot.** Changes made on Unraid after the last sync are not reflected on Nobara. Re-sync before maintenance.
|
|
- **No real-time replication.** Vaultwarden passwords saved during failover will not sync back to Unraid automatically.
|
|
- **Only critical services replicated.** Other services (Plex, Gitea, NetBox, etc.) will be offline during maintenance.
|
|
- **External DNS not updated.** Failover only works for clients using the local DNS (AdGuard) that resolves to the VIP. External access via Cloudflare will not failover.
|
|
|
|
---
|
|
|
|
## SSH Access
|
|
|
|
```bash
|
|
# From Mac to Nobara (passwordless, key-based)
|
|
ssh nobara
|
|
# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103
|
|
|
|
# Sudo on Nobara requires password: (check password manager)
|
|
```
|
|
|
|
---
|
|
|
|
## Recovery After Maintenance
|
|
|
|
1. Bring Unraid back online
|
|
2. Verify all Unraid services are running: `docker ps`
|
|
3. Keepalived on Unraid will auto-reclaim VIP (preemption)
|
|
4. Stop failover on Nobara: `cd /home/failover && sudo docker compose down`
|
|
5. If Vaultwarden was used during failover, manually export/import any new entries
|
|
|
|
---
|
|
|
|
## Architecture Diagram
|
|
|
|
```
|
|
┌─────────────────────┐
|
|
│ 192.168.10.250 │
|
|
│ (VRRP VIP) │
|
|
└─────────┬───────────┘
|
|
│
|
|
┌───────────────┼───────────────┐
|
|
│ │
|
|
┌─────────▼─────────┐ ┌─────────▼─────────┐
|
|
│ XTRM-U (Unraid) │ │ XTRM-Nobara │
|
|
│ 192.168.10.20 │ │ 192.168.10.103 │
|
|
│ MASTER (150) │ │ BACKUP (100) │
|
|
│ │ │ │
|
|
│ ┌──────────────┐ │ │ ┌──────────────┐ │
|
|
│ │ Traefik │ │ │ │ Traefik │ │
|
|
│ │ Vaultwarden │ │ │ │ Vaultwarden │ │
|
|
│ │ Authentik │ │ │ │ Authentik │ │
|
|
│ │ AdGuard │ │ │ │ AdGuard │ │
|
|
│ │ + 25 more │ │ │ │ PostgreSQL │ │
|
|
│ └──────────────┘ │ │ │ Redis │ │
|
|
│ │ │ └──────────────┘ │
|
|
│ Keepalived (Docker)│ │ Keepalived (systemd)│
|
|
└────────────────────┘ └────────────────────┘
|
|
```
|