Add VRRP failover infrastructure documentation (Nobara)
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful

Deployed automatic failover for critical services (Traefik, Vaultwarden,
Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP
with VIP 192.168.10.250. ~4 second failover time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Kaloyan Danchev
2026-02-13 18:03:26 +02:00
parent d2f49e9130
commit ecbce1ca94
6 changed files with 357 additions and 2 deletions

View File

@@ -204,6 +204,25 @@ When recovering from full outage:
---
## Active Failover: XTRM-Nobara
Critical services are replicated on the Nobara workstation with automatic VRRP failover:
| Service | Primary (XTRM-U) | Failover (XTRM-Nobara) |
|---------|-------------------|------------------------|
| Traefik | 192.168.10.20 | 192.168.10.103 |
| Vaultwarden | 192.168.10.20 | 192.168.10.103 |
| Authentik | 192.168.10.20 | 192.168.10.103 |
| AdGuard Home | 192.168.10.20 | 192.168.10.103 |
**VIP:** 192.168.10.250 (floats between XTRM-U and XTRM-Nobara via Keepalived VRRP)
**Failover time:** ~4 seconds
See: `10-FAILOVER-NOBARA.md` for full documentation.
---
## Future: XTRM-N1 Survival Node
When hardware upgrade completes, these services will have replicas on XTRM-N1:

View File

@@ -160,12 +160,34 @@
---
## Workstations
### XTRM-Nobara | Nobara Linux Workstation
| Property | Value |
|----------|-------|
| **Role** | Workstation + Failover Node |
| **Location** | Main Bedroom |
| **IP** | 192.168.10.103 |
| **MAC** | 08:92:04:C6:07:C5 |
| **OS** | Nobara Linux (Fedora 43 based) |
| **CPU** | AMD Ryzen 9 6900HX (8C/16T) |
| **RAM** | 16 GB |
| **Storage** | 477GB NVMe (OS) + 1.8TB NVMe (btrfs pool with OS drive) |
| **Network** | enp5s0 (2.5G Ethernet) |
| **Switch Port** | CSS1-20 via PP1 M2 |
| **SSH** | `ssh nobara` (key: ~/.ssh/id_ed25519_nobara) |
**Failover Services:** Traefik, Vaultwarden, Authentik, AdGuard Home
**Keepalived:** systemd service, BACKUP priority 100, VIP 192.168.10.250
---
## End Devices (Wired)
| Device | Room | Outlet | Switch Port | MAC |
|--------|------|--------|-------------|-----|
| LGTV | Living Room | L3 | CSS1-24 | - |
| XTRM-Nobara | Main Bedroom | M2 | CSS1-20 | 08:92:04:C6:07:C5 |
| Dell Display | Main Bedroom | M3 | CSS1-21 | - |
| Dancho | Boys Room | B1 | CSS1-18 | - |
| KVM Switch | - | Direct | CSS1-2 | - |

276
docs/10-FAILOVER-NOBARA.md Normal file
View File

@@ -0,0 +1,276 @@
# Failover Infrastructure - Nobara (XTRM-Nobara)
**Last Updated:** 2026-02-13
**Purpose:** Temporary failover for critical services during Unraid maintenance windows.
---
## Overview
A Docker-based replica of critical services runs on the Nobara Linux workstation (XTRM-Nobara) with automatic failover via Keepalived VRRP. When Unraid goes offline, the virtual IP floats to Nobara and services continue operating.
```
Clients → 192.168.10.250 (VIP) → XTRM-U (MASTER, priority 150)
↓ failover (~4 seconds)
XTRM-Nobara (BACKUP, priority 100)
```
---
## Machines
| Role | Host | IP | Interface | Priority |
|------|------|-----|-----------|----------|
| **MASTER** | XTRM-U (Unraid) | 192.168.10.20 | br0 | 150 |
| **BACKUP** | XTRM-Nobara | 192.168.10.103 | enp5s0 | 100 |
| **VIP** | Shared | 192.168.10.250 | — | — |
---
## Replicated Services
| Service | Image | Ports (Nobara) | Domain |
|---------|-------|----------------|--------|
| **Traefik** | traefik:latest | 80, 443, 8080 | *.xtrm-lab.org |
| **Vaultwarden** | vaultwarden/server:latest | internal:80 | vault.xtrm-lab.org |
| **Authentik** | ghcr.io/goauthentik/server:2025.8.1 | internal:9000 | auth.xtrm-lab.org |
| **Authentik Worker** | ghcr.io/goauthentik/server:2025.8.1 | — | — |
| **PostgreSQL** | postgres:17 | internal:5432 | — |
| **Redis** | redis:7-alpine | internal:6379 | — |
| **AdGuard Home** | adguard/adguardhome:latest | 192.168.10.103:53, 3000 | — |
---
## File Locations
### Nobara (XTRM-Nobara)
| Path | Contents |
|------|----------|
| `/home/failover/docker-compose.yml` | Main compose stack |
| `/home/failover/traefik/` | Traefik config, certs, acme.json |
| `/home/failover/vaultwarden/` | Vaultwarden data (copy from Unraid) |
| `/home/failover/authentik/` | Authentik media & templates |
| `/home/failover/postgres/` | PostgreSQL data + initial dump |
| `/home/failover/redis/` | Redis data |
| `/home/failover/adguard/` | AdGuard conf & work dirs |
| `/etc/keepalived/keepalived.conf` | Keepalived VRRP config |
| `/usr/local/bin/check_failover.sh` | Health check script |
| `/usr/local/bin/failover-notify.sh` | State change notification script |
| `/var/log/keepalived-failover.log` | Failover event log |
### Unraid (XTRM-U)
| Path | Contents |
|------|----------|
| `/mnt/user/appdata/keepalived/keepalived.conf` | Keepalived VRRP config |
| `/mnt/user/appdata/keepalived/check_services.sh` | Health check script |
---
## Keepalived Configuration
### VRRP Parameters
| Parameter | Value |
|-----------|-------|
| Virtual Router ID | 51 |
| Auth Type | PASS |
| Auth Password | xtrm2026 |
| Advertisement Interval | 1 second |
| Health Check Interval | 5 seconds |
| Fail Threshold | 3 missed checks |
| Recovery Threshold | 2 successful checks |
### Unraid (MASTER)
- Runs as Docker container: `local/keepalived` (built from alpine + keepalived + curl)
- Priority: 150 (+ health check weight 2 = 152 when healthy)
- Health check: curls `http://localhost:8183/api/overview` (Traefik dashboard)
- Preemption: enabled (will reclaim VIP from Nobara when healthy)
```bash
# Start/stop on Unraid
docker start keepalived
docker stop keepalived
docker logs keepalived
```
### Nobara (BACKUP)
- Runs as systemd service: `keepalived.service`
- Priority: 100 (+ health check weight 2 = 102 when healthy)
- Health check: verifies Traefik and Vaultwarden containers are running
- `nopreempt` set (won't fight for VIP if Unraid is healthy)
```bash
# Start/stop on Nobara
sudo systemctl start keepalived
sudo systemctl stop keepalived
sudo journalctl -u keepalived -f
```
---
## DNS Strategy
**Approach:** Local DNS override via AdGuard Home.
To route traffic through the VIP for internal clients, configure AdGuard DNS rewrite rules to resolve `*.xtrm-lab.org``192.168.10.250`. External (Cloudflare) DNS remains pointed at Unraid's public IP.
---
## Operations
### Before Maintenance (Data Sync)
Run these commands from the Mac to sync latest data to Nobara:
```bash
# 1. Sync Vaultwarden data
ssh unraid "tar czf - -C /mnt/user/appdata vaultwarden/" | \
ssh nobara "tar xzf - -C /home/failover/"
# 2. Dump and sync Authentik database
ssh unraid "docker exec postgresql17 pg_dump -U authentik_user authentik_db" | \
ssh nobara "cat > /home/failover/postgres/authentik_dump.sql"
# 3. Sync AdGuard config
ssh unraid "tar czf - -C /mnt/user/appdata/adguardhome conf/ work/" | \
ssh nobara "tar xzf - -C /home/failover/adguard/"
# 4. Sync Traefik config and certs
ssh unraid "tar czf - -C /mnt/user/appdata/traefik traefik.yml dynamic.yml acme.json certs/" | \
ssh nobara "tar xzf - -C /home/failover/traefik/"
```
**Note:** `ssh unraid` = `ssh -i ~/.ssh/id_ed25519_unraid -p 422 root@192.168.10.20`
### Start Failover Services
```bash
# On Nobara
cd /home/failover
sudo docker compose up -d
sudo systemctl start keepalived
```
### Stop Failover Services
```bash
# On Nobara
cd /home/failover
sudo docker compose down
sudo systemctl stop keepalived
```
### Test Failover
```bash
# 1. Check VIP location
ssh unraid "ip addr show br0 | grep inet"
ssh nobara "ip addr show enp5s0 | grep inet"
# 2. Simulate Unraid failure
ssh unraid "docker stop keepalived"
# 3. Verify VIP moved to Nobara (wait ~4 seconds)
ssh nobara "ip addr show enp5s0 | grep inet"
# 4. Restore Unraid
ssh unraid "docker start keepalived"
# 5. Verify VIP returned to Unraid
ssh unraid "ip addr show br0 | grep inet"
```
### Check Status
```bash
# Nobara service status
ssh nobara "sudo docker ps --format 'table {{.Names}}\t{{.Status}}'"
# Nobara keepalived state
ssh nobara "sudo journalctl -u keepalived -n 10 --no-pager"
# Unraid keepalived state
ssh unraid "docker logs keepalived --tail 10"
# Which machine holds the VIP?
ping -c 1 192.168.10.250
```
---
## Traefik Configuration (Failover)
The Nobara Traefik instance has a **reduced** dynamic.yml that only serves the four critical services:
| Router | Domain | Backend |
|--------|--------|---------|
| vaultwarden-secure | vault.xtrm-lab.org | http://vaultwarden:80 |
| authentik-secure | auth.xtrm-lab.org | http://authentik:9000 |
| traefik-secure | traefik.xtrm-lab.org | api@internal |
TLS certificates are shared (copied from Unraid's acme.json + static certs).
---
## Limitations
- **Data is a point-in-time snapshot.** Changes made on Unraid after the last sync are not reflected on Nobara. Re-sync before maintenance.
- **No real-time replication.** Vaultwarden passwords saved during failover will not sync back to Unraid automatically.
- **Only critical services replicated.** Other services (Plex, Gitea, NetBox, etc.) will be offline during maintenance.
- **External DNS not updated.** Failover only works for clients using the local DNS (AdGuard) that resolves to the VIP. External access via Cloudflare will not failover.
---
## SSH Access
```bash
# From Mac to Nobara (passwordless, key-based)
ssh nobara
# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103
# Sudo on Nobara requires password: (check password manager)
```
---
## Recovery After Maintenance
1. Bring Unraid back online
2. Verify all Unraid services are running: `docker ps`
3. Keepalived on Unraid will auto-reclaim VIP (preemption)
4. Stop failover on Nobara: `cd /home/failover && sudo docker compose down`
5. If Vaultwarden was used during failover, manually export/import any new entries
---
## Architecture Diagram
```
┌─────────────────────┐
│ 192.168.10.250 │
│ (VRRP VIP) │
└─────────┬───────────┘
┌───────────────┼───────────────┐
│ │
┌─────────▼─────────┐ ┌─────────▼─────────┐
│ XTRM-U (Unraid) │ │ XTRM-Nobara │
│ 192.168.10.20 │ │ 192.168.10.103 │
│ MASTER (150) │ │ BACKUP (100) │
│ │ │ │
│ ┌──────────────┐ │ │ ┌──────────────┐ │
│ │ Traefik │ │ │ │ Traefik │ │
│ │ Vaultwarden │ │ │ │ Vaultwarden │ │
│ │ Authentik │ │ │ │ Authentik │ │
│ │ AdGuard │ │ │ │ AdGuard │ │
│ │ + 25 more │ │ │ │ PostgreSQL │ │
│ └──────────────┘ │ │ │ Redis │ │
│ │ │ └──────────────┘ │
│ Keepalived (Docker)│ │ Keepalived (systemd)│
└────────────────────┘ └────────────────────┘
```

View File

@@ -4,6 +4,25 @@
---
## 2026-02-13
### Failover Infrastructure Deployed
- **[SERVICE]** Deployed Docker failover stack on XTRM-Nobara (Traefik, Vaultwarden, Authentik, AdGuard Home)
- **[SERVICE]** Installed Docker CE 29.2.1 + Docker Compose 5.0.2 on Nobara
- **[SERVICE]** Deployed Keepalived VRRP for automatic failover (VIP: 192.168.10.250)
- **[SERVICE]** Unraid: Keepalived as Docker container (local/keepalived, MASTER priority 150)
- **[SERVICE]** Nobara: Keepalived as systemd service (BACKUP priority 100)
- **[SERVICE]** Replicated data: Vaultwarden DB, Authentik PostgreSQL dump (864MB), AdGuard config, Traefik certs
- **[NETWORK]** Added VRRP protocol to Nobara firewall (firewalld)
- **[NETWORK]** Configured SSH key auth to Nobara (id_ed25519_nobara, passwordless)
- **[NETWORK]** Added SSH config alias: `ssh nobara`
- **[DOCS]** Created 10-FAILOVER-NOBARA.md with full failover documentation
- **[DOCS]** Updated 02-SERVICES-CRITICAL.md with failover section
- **[DOCS]** Updated 04-HARDWARE-INVENTORY.md with XTRM-Nobara specs
- **[DOCS]** Updated README.md and CLAUDE.md with Nobara references
---
## 2026-02-06
### Unraid Flash Drive Failure