Add VRRP failover infrastructure documentation (Nobara)
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Deployed automatic failover for critical services (Traefik, Vaultwarden, Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP with VIP 192.168.10.250. ~4 second failover time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
12
CLAUDE.md
12
CLAUDE.md
@@ -7,6 +7,17 @@ When user says "connect unraid", use this command:
|
||||
ssh -i ~/.ssh/id_ed25519_unraid root@192.168.10.20 -p 422
|
||||
```
|
||||
|
||||
## Connect to Nobara (Failover Node)
|
||||
|
||||
```bash
|
||||
ssh nobara
|
||||
# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103
|
||||
# sudo password: (same as SSH login)
|
||||
```
|
||||
|
||||
Failover stack: `/home/failover/docker-compose.yml`
|
||||
Keepalived: `systemctl status keepalived`
|
||||
|
||||
## Connect to MikroTik HAP ax³
|
||||
|
||||
SSH port is **2222** (not 22):
|
||||
@@ -56,6 +67,7 @@ infrastructure/
|
||||
├── 07-WIFI-CAPSMAN-CONFIG.md # WiFi and CAPsMAN settings
|
||||
├── 08-DNS-ARCHITECTURE.md # DNS failover architecture
|
||||
├── 09-TAILSCALE-VPN.md # Tailscale VPN setup
|
||||
├── 10-FAILOVER-NOBARA.md # VRRP failover to Nobara
|
||||
├── CHANGELOG.md # Change history
|
||||
├── archive/ # Completed/legacy docs
|
||||
│ └── vlan-migration/ # VLAN migration project artifacts
|
||||
|
||||
@@ -15,6 +15,7 @@
|
||||
| **CI/CD** | https://ci.xtrm-lab.org |
|
||||
| **DNS Primary** | dns.xtrm-lab.org |
|
||||
| **DNS Secondary** | dns2.xtrm-lab.org |
|
||||
| **Failover VIP** | 192.168.10.250 |
|
||||
|
||||
---
|
||||
|
||||
@@ -31,6 +32,7 @@ docs/
|
||||
├── 07-WIFI-CAPSMAN-CONFIG.md # WiFi and CAPsMAN settings
|
||||
├── 08-DNS-ARCHITECTURE.md # DNS failover architecture
|
||||
├── 09-TAILSCALE-VPN.md # Tailscale VPN setup
|
||||
├── 10-FAILOVER-NOBARA.md # VRRP failover to Nobara workstation
|
||||
├── CHANGELOG.md # Change history
|
||||
├── archive/ # Completed/legacy docs
|
||||
│ └── vlan-migration/ # VLAN migration project artifacts
|
||||
@@ -46,6 +48,7 @@ docs/
|
||||
|--------|-----|------|
|
||||
| HAP1 | 192.168.10.1 | Router, DNS, WiFi Controller |
|
||||
| XTRM-U | 192.168.10.20 | Production Server (Unraid) |
|
||||
| XTRM-Nobara | 192.168.10.103 | Failover Node (Nobara Linux) |
|
||||
| CSS1 | 192.168.10.3 | Distribution Switch |
|
||||
| ZX1 | 192.168.10.4 | Core Switch (2.5G) |
|
||||
| CAP | 192.168.10.6 | Wireless Access Point |
|
||||
@@ -60,6 +63,9 @@ ssh -i ~/.ssh/id_ed25519_unraid root@192.168.10.20 -p 422
|
||||
|
||||
# MikroTik Router
|
||||
ssh -i ~/.ssh/mikrotik_key -p 2222 xtrm@192.168.10.1
|
||||
|
||||
# Nobara (failover node)
|
||||
ssh nobara
|
||||
```
|
||||
|
||||
---
|
||||
@@ -69,7 +75,8 @@ ssh -i ~/.ssh/mikrotik_key -p 2222 xtrm@192.168.10.1
|
||||
1. **DNS down?** → Automatic failover to 192.168.10.10 (secondary), see `08-DNS-ARCHITECTURE.md`
|
||||
2. **Internet down?** → Check HAP1 at 192.168.10.1
|
||||
3. **Services down?** → Check Unraid at 192.168.10.20
|
||||
4. **Full outage?** → See `02-SERVICES-CRITICAL.md` startup order
|
||||
4. **Unraid maintenance?** → VRRP failover to Nobara (192.168.10.250 VIP), see `10-FAILOVER-NOBARA.md`
|
||||
5. **Full outage?** → See `02-SERVICES-CRITICAL.md` startup order
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -204,6 +204,25 @@ When recovering from full outage:
|
||||
|
||||
---
|
||||
|
||||
## Active Failover: XTRM-Nobara
|
||||
|
||||
Critical services are replicated on the Nobara workstation with automatic VRRP failover:
|
||||
|
||||
| Service | Primary (XTRM-U) | Failover (XTRM-Nobara) |
|
||||
|---------|-------------------|------------------------|
|
||||
| Traefik | 192.168.10.20 | 192.168.10.103 |
|
||||
| Vaultwarden | 192.168.10.20 | 192.168.10.103 |
|
||||
| Authentik | 192.168.10.20 | 192.168.10.103 |
|
||||
| AdGuard Home | 192.168.10.20 | 192.168.10.103 |
|
||||
|
||||
**VIP:** 192.168.10.250 (floats between XTRM-U and XTRM-Nobara via Keepalived VRRP)
|
||||
|
||||
**Failover time:** ~4 seconds
|
||||
|
||||
See: `10-FAILOVER-NOBARA.md` for full documentation.
|
||||
|
||||
---
|
||||
|
||||
## Future: XTRM-N1 Survival Node
|
||||
|
||||
When hardware upgrade completes, these services will have replicas on XTRM-N1:
|
||||
|
||||
@@ -160,12 +160,34 @@
|
||||
|
||||
---
|
||||
|
||||
## Workstations
|
||||
|
||||
### XTRM-Nobara | Nobara Linux Workstation
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **Role** | Workstation + Failover Node |
|
||||
| **Location** | Main Bedroom |
|
||||
| **IP** | 192.168.10.103 |
|
||||
| **MAC** | 08:92:04:C6:07:C5 |
|
||||
| **OS** | Nobara Linux (Fedora 43 based) |
|
||||
| **CPU** | AMD Ryzen 9 6900HX (8C/16T) |
|
||||
| **RAM** | 16 GB |
|
||||
| **Storage** | 477GB NVMe (OS) + 1.8TB NVMe (btrfs pool with OS drive) |
|
||||
| **Network** | enp5s0 (2.5G Ethernet) |
|
||||
| **Switch Port** | CSS1-20 via PP1 M2 |
|
||||
| **SSH** | `ssh nobara` (key: ~/.ssh/id_ed25519_nobara) |
|
||||
|
||||
**Failover Services:** Traefik, Vaultwarden, Authentik, AdGuard Home
|
||||
**Keepalived:** systemd service, BACKUP priority 100, VIP 192.168.10.250
|
||||
|
||||
---
|
||||
|
||||
## End Devices (Wired)
|
||||
|
||||
| Device | Room | Outlet | Switch Port | MAC |
|
||||
|--------|------|--------|-------------|-----|
|
||||
| LGTV | Living Room | L3 | CSS1-24 | - |
|
||||
| XTRM-Nobara | Main Bedroom | M2 | CSS1-20 | 08:92:04:C6:07:C5 |
|
||||
| Dell Display | Main Bedroom | M3 | CSS1-21 | - |
|
||||
| Dancho | Boys Room | B1 | CSS1-18 | - |
|
||||
| KVM Switch | - | Direct | CSS1-2 | - |
|
||||
|
||||
276
docs/10-FAILOVER-NOBARA.md
Normal file
276
docs/10-FAILOVER-NOBARA.md
Normal file
@@ -0,0 +1,276 @@
|
||||
# Failover Infrastructure - Nobara (XTRM-Nobara)
|
||||
|
||||
**Last Updated:** 2026-02-13
|
||||
|
||||
**Purpose:** Temporary failover for critical services during Unraid maintenance windows.
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
A Docker-based replica of critical services runs on the Nobara Linux workstation (XTRM-Nobara) with automatic failover via Keepalived VRRP. When Unraid goes offline, the virtual IP floats to Nobara and services continue operating.
|
||||
|
||||
```
|
||||
Clients → 192.168.10.250 (VIP) → XTRM-U (MASTER, priority 150)
|
||||
↓ failover (~4 seconds)
|
||||
XTRM-Nobara (BACKUP, priority 100)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Machines
|
||||
|
||||
| Role | Host | IP | Interface | Priority |
|
||||
|------|------|-----|-----------|----------|
|
||||
| **MASTER** | XTRM-U (Unraid) | 192.168.10.20 | br0 | 150 |
|
||||
| **BACKUP** | XTRM-Nobara | 192.168.10.103 | enp5s0 | 100 |
|
||||
| **VIP** | Shared | 192.168.10.250 | — | — |
|
||||
|
||||
---
|
||||
|
||||
## Replicated Services
|
||||
|
||||
| Service | Image | Ports (Nobara) | Domain |
|
||||
|---------|-------|----------------|--------|
|
||||
| **Traefik** | traefik:latest | 80, 443, 8080 | *.xtrm-lab.org |
|
||||
| **Vaultwarden** | vaultwarden/server:latest | internal:80 | vault.xtrm-lab.org |
|
||||
| **Authentik** | ghcr.io/goauthentik/server:2025.8.1 | internal:9000 | auth.xtrm-lab.org |
|
||||
| **Authentik Worker** | ghcr.io/goauthentik/server:2025.8.1 | — | — |
|
||||
| **PostgreSQL** | postgres:17 | internal:5432 | — |
|
||||
| **Redis** | redis:7-alpine | internal:6379 | — |
|
||||
| **AdGuard Home** | adguard/adguardhome:latest | 192.168.10.103:53, 3000 | — |
|
||||
|
||||
---
|
||||
|
||||
## File Locations
|
||||
|
||||
### Nobara (XTRM-Nobara)
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `/home/failover/docker-compose.yml` | Main compose stack |
|
||||
| `/home/failover/traefik/` | Traefik config, certs, acme.json |
|
||||
| `/home/failover/vaultwarden/` | Vaultwarden data (copy from Unraid) |
|
||||
| `/home/failover/authentik/` | Authentik media & templates |
|
||||
| `/home/failover/postgres/` | PostgreSQL data + initial dump |
|
||||
| `/home/failover/redis/` | Redis data |
|
||||
| `/home/failover/adguard/` | AdGuard conf & work dirs |
|
||||
| `/etc/keepalived/keepalived.conf` | Keepalived VRRP config |
|
||||
| `/usr/local/bin/check_failover.sh` | Health check script |
|
||||
| `/usr/local/bin/failover-notify.sh` | State change notification script |
|
||||
| `/var/log/keepalived-failover.log` | Failover event log |
|
||||
|
||||
### Unraid (XTRM-U)
|
||||
|
||||
| Path | Contents |
|
||||
|------|----------|
|
||||
| `/mnt/user/appdata/keepalived/keepalived.conf` | Keepalived VRRP config |
|
||||
| `/mnt/user/appdata/keepalived/check_services.sh` | Health check script |
|
||||
|
||||
---
|
||||
|
||||
## Keepalived Configuration
|
||||
|
||||
### VRRP Parameters
|
||||
|
||||
| Parameter | Value |
|
||||
|-----------|-------|
|
||||
| Virtual Router ID | 51 |
|
||||
| Auth Type | PASS |
|
||||
| Auth Password | xtrm2026 |
|
||||
| Advertisement Interval | 1 second |
|
||||
| Health Check Interval | 5 seconds |
|
||||
| Fail Threshold | 3 missed checks |
|
||||
| Recovery Threshold | 2 successful checks |
|
||||
|
||||
### Unraid (MASTER)
|
||||
|
||||
- Runs as Docker container: `local/keepalived` (built from alpine + keepalived + curl)
|
||||
- Priority: 150 (+ health check weight 2 = 152 when healthy)
|
||||
- Health check: curls `http://localhost:8183/api/overview` (Traefik dashboard)
|
||||
- Preemption: enabled (will reclaim VIP from Nobara when healthy)
|
||||
|
||||
```bash
|
||||
# Start/stop on Unraid
|
||||
docker start keepalived
|
||||
docker stop keepalived
|
||||
docker logs keepalived
|
||||
```
|
||||
|
||||
### Nobara (BACKUP)
|
||||
|
||||
- Runs as systemd service: `keepalived.service`
|
||||
- Priority: 100 (+ health check weight 2 = 102 when healthy)
|
||||
- Health check: verifies Traefik and Vaultwarden containers are running
|
||||
- `nopreempt` set (won't fight for VIP if Unraid is healthy)
|
||||
|
||||
```bash
|
||||
# Start/stop on Nobara
|
||||
sudo systemctl start keepalived
|
||||
sudo systemctl stop keepalived
|
||||
sudo journalctl -u keepalived -f
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## DNS Strategy
|
||||
|
||||
**Approach:** Local DNS override via AdGuard Home.
|
||||
|
||||
To route traffic through the VIP for internal clients, configure AdGuard DNS rewrite rules to resolve `*.xtrm-lab.org` → `192.168.10.250`. External (Cloudflare) DNS remains pointed at Unraid's public IP.
|
||||
|
||||
---
|
||||
|
||||
## Operations
|
||||
|
||||
### Before Maintenance (Data Sync)
|
||||
|
||||
Run these commands from the Mac to sync latest data to Nobara:
|
||||
|
||||
```bash
|
||||
# 1. Sync Vaultwarden data
|
||||
ssh unraid "tar czf - -C /mnt/user/appdata vaultwarden/" | \
|
||||
ssh nobara "tar xzf - -C /home/failover/"
|
||||
|
||||
# 2. Dump and sync Authentik database
|
||||
ssh unraid "docker exec postgresql17 pg_dump -U authentik_user authentik_db" | \
|
||||
ssh nobara "cat > /home/failover/postgres/authentik_dump.sql"
|
||||
|
||||
# 3. Sync AdGuard config
|
||||
ssh unraid "tar czf - -C /mnt/user/appdata/adguardhome conf/ work/" | \
|
||||
ssh nobara "tar xzf - -C /home/failover/adguard/"
|
||||
|
||||
# 4. Sync Traefik config and certs
|
||||
ssh unraid "tar czf - -C /mnt/user/appdata/traefik traefik.yml dynamic.yml acme.json certs/" | \
|
||||
ssh nobara "tar xzf - -C /home/failover/traefik/"
|
||||
```
|
||||
|
||||
**Note:** `ssh unraid` = `ssh -i ~/.ssh/id_ed25519_unraid -p 422 root@192.168.10.20`
|
||||
|
||||
### Start Failover Services
|
||||
|
||||
```bash
|
||||
# On Nobara
|
||||
cd /home/failover
|
||||
sudo docker compose up -d
|
||||
sudo systemctl start keepalived
|
||||
```
|
||||
|
||||
### Stop Failover Services
|
||||
|
||||
```bash
|
||||
# On Nobara
|
||||
cd /home/failover
|
||||
sudo docker compose down
|
||||
sudo systemctl stop keepalived
|
||||
```
|
||||
|
||||
### Test Failover
|
||||
|
||||
```bash
|
||||
# 1. Check VIP location
|
||||
ssh unraid "ip addr show br0 | grep inet"
|
||||
ssh nobara "ip addr show enp5s0 | grep inet"
|
||||
|
||||
# 2. Simulate Unraid failure
|
||||
ssh unraid "docker stop keepalived"
|
||||
|
||||
# 3. Verify VIP moved to Nobara (wait ~4 seconds)
|
||||
ssh nobara "ip addr show enp5s0 | grep inet"
|
||||
|
||||
# 4. Restore Unraid
|
||||
ssh unraid "docker start keepalived"
|
||||
|
||||
# 5. Verify VIP returned to Unraid
|
||||
ssh unraid "ip addr show br0 | grep inet"
|
||||
```
|
||||
|
||||
### Check Status
|
||||
|
||||
```bash
|
||||
# Nobara service status
|
||||
ssh nobara "sudo docker ps --format 'table {{.Names}}\t{{.Status}}'"
|
||||
|
||||
# Nobara keepalived state
|
||||
ssh nobara "sudo journalctl -u keepalived -n 10 --no-pager"
|
||||
|
||||
# Unraid keepalived state
|
||||
ssh unraid "docker logs keepalived --tail 10"
|
||||
|
||||
# Which machine holds the VIP?
|
||||
ping -c 1 192.168.10.250
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Traefik Configuration (Failover)
|
||||
|
||||
The Nobara Traefik instance has a **reduced** dynamic.yml that only serves the four critical services:
|
||||
|
||||
| Router | Domain | Backend |
|
||||
|--------|--------|---------|
|
||||
| vaultwarden-secure | vault.xtrm-lab.org | http://vaultwarden:80 |
|
||||
| authentik-secure | auth.xtrm-lab.org | http://authentik:9000 |
|
||||
| traefik-secure | traefik.xtrm-lab.org | api@internal |
|
||||
|
||||
TLS certificates are shared (copied from Unraid's acme.json + static certs).
|
||||
|
||||
---
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Data is a point-in-time snapshot.** Changes made on Unraid after the last sync are not reflected on Nobara. Re-sync before maintenance.
|
||||
- **No real-time replication.** Vaultwarden passwords saved during failover will not sync back to Unraid automatically.
|
||||
- **Only critical services replicated.** Other services (Plex, Gitea, NetBox, etc.) will be offline during maintenance.
|
||||
- **External DNS not updated.** Failover only works for clients using the local DNS (AdGuard) that resolves to the VIP. External access via Cloudflare will not failover.
|
||||
|
||||
---
|
||||
|
||||
## SSH Access
|
||||
|
||||
```bash
|
||||
# From Mac to Nobara (passwordless, key-based)
|
||||
ssh nobara
|
||||
# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103
|
||||
|
||||
# Sudo on Nobara requires password: (check password manager)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recovery After Maintenance
|
||||
|
||||
1. Bring Unraid back online
|
||||
2. Verify all Unraid services are running: `docker ps`
|
||||
3. Keepalived on Unraid will auto-reclaim VIP (preemption)
|
||||
4. Stop failover on Nobara: `cd /home/failover && sudo docker compose down`
|
||||
5. If Vaultwarden was used during failover, manually export/import any new entries
|
||||
|
||||
---
|
||||
|
||||
## Architecture Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────┐
|
||||
│ 192.168.10.250 │
|
||||
│ (VRRP VIP) │
|
||||
└─────────┬───────────┘
|
||||
│
|
||||
┌───────────────┼───────────────┐
|
||||
│ │
|
||||
┌─────────▼─────────┐ ┌─────────▼─────────┐
|
||||
│ XTRM-U (Unraid) │ │ XTRM-Nobara │
|
||||
│ 192.168.10.20 │ │ 192.168.10.103 │
|
||||
│ MASTER (150) │ │ BACKUP (100) │
|
||||
│ │ │ │
|
||||
│ ┌──────────────┐ │ │ ┌──────────────┐ │
|
||||
│ │ Traefik │ │ │ │ Traefik │ │
|
||||
│ │ Vaultwarden │ │ │ │ Vaultwarden │ │
|
||||
│ │ Authentik │ │ │ │ Authentik │ │
|
||||
│ │ AdGuard │ │ │ │ AdGuard │ │
|
||||
│ │ + 25 more │ │ │ │ PostgreSQL │ │
|
||||
│ └──────────────┘ │ │ │ Redis │ │
|
||||
│ │ │ └──────────────┘ │
|
||||
│ Keepalived (Docker)│ │ Keepalived (systemd)│
|
||||
└────────────────────┘ └────────────────────┘
|
||||
```
|
||||
@@ -4,6 +4,25 @@
|
||||
|
||||
---
|
||||
|
||||
## 2026-02-13
|
||||
|
||||
### Failover Infrastructure Deployed
|
||||
- **[SERVICE]** Deployed Docker failover stack on XTRM-Nobara (Traefik, Vaultwarden, Authentik, AdGuard Home)
|
||||
- **[SERVICE]** Installed Docker CE 29.2.1 + Docker Compose 5.0.2 on Nobara
|
||||
- **[SERVICE]** Deployed Keepalived VRRP for automatic failover (VIP: 192.168.10.250)
|
||||
- **[SERVICE]** Unraid: Keepalived as Docker container (local/keepalived, MASTER priority 150)
|
||||
- **[SERVICE]** Nobara: Keepalived as systemd service (BACKUP priority 100)
|
||||
- **[SERVICE]** Replicated data: Vaultwarden DB, Authentik PostgreSQL dump (864MB), AdGuard config, Traefik certs
|
||||
- **[NETWORK]** Added VRRP protocol to Nobara firewall (firewalld)
|
||||
- **[NETWORK]** Configured SSH key auth to Nobara (id_ed25519_nobara, passwordless)
|
||||
- **[NETWORK]** Added SSH config alias: `ssh nobara`
|
||||
- **[DOCS]** Created 10-FAILOVER-NOBARA.md with full failover documentation
|
||||
- **[DOCS]** Updated 02-SERVICES-CRITICAL.md with failover section
|
||||
- **[DOCS]** Updated 04-HARDWARE-INVENTORY.md with XTRM-Nobara specs
|
||||
- **[DOCS]** Updated README.md and CLAUDE.md with Nobara references
|
||||
|
||||
---
|
||||
|
||||
## 2026-02-06
|
||||
|
||||
### Unraid Flash Drive Failure
|
||||
|
||||
Reference in New Issue
Block a user