diff --git a/CLAUDE.md b/CLAUDE.md index a942e99..addc811 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -7,6 +7,17 @@ When user says "connect unraid", use this command: ssh -i ~/.ssh/id_ed25519_unraid root@192.168.10.20 -p 422 ``` +## Connect to Nobara (Failover Node) + +```bash +ssh nobara +# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103 +# sudo password: (same as SSH login) +``` + +Failover stack: `/home/failover/docker-compose.yml` +Keepalived: `systemctl status keepalived` + ## Connect to MikroTik HAP ax³ SSH port is **2222** (not 22): @@ -56,6 +67,7 @@ infrastructure/ ├── 07-WIFI-CAPSMAN-CONFIG.md # WiFi and CAPsMAN settings ├── 08-DNS-ARCHITECTURE.md # DNS failover architecture ├── 09-TAILSCALE-VPN.md # Tailscale VPN setup + ├── 10-FAILOVER-NOBARA.md # VRRP failover to Nobara ├── CHANGELOG.md # Change history ├── archive/ # Completed/legacy docs │ └── vlan-migration/ # VLAN migration project artifacts diff --git a/README.md b/README.md index bd79984..f40242a 100644 --- a/README.md +++ b/README.md @@ -15,6 +15,7 @@ | **CI/CD** | https://ci.xtrm-lab.org | | **DNS Primary** | dns.xtrm-lab.org | | **DNS Secondary** | dns2.xtrm-lab.org | +| **Failover VIP** | 192.168.10.250 | --- @@ -31,6 +32,7 @@ docs/ ├── 07-WIFI-CAPSMAN-CONFIG.md # WiFi and CAPsMAN settings ├── 08-DNS-ARCHITECTURE.md # DNS failover architecture ├── 09-TAILSCALE-VPN.md # Tailscale VPN setup +├── 10-FAILOVER-NOBARA.md # VRRP failover to Nobara workstation ├── CHANGELOG.md # Change history ├── archive/ # Completed/legacy docs │ └── vlan-migration/ # VLAN migration project artifacts @@ -46,6 +48,7 @@ docs/ |--------|-----|------| | HAP1 | 192.168.10.1 | Router, DNS, WiFi Controller | | XTRM-U | 192.168.10.20 | Production Server (Unraid) | +| XTRM-Nobara | 192.168.10.103 | Failover Node (Nobara Linux) | | CSS1 | 192.168.10.3 | Distribution Switch | | ZX1 | 192.168.10.4 | Core Switch (2.5G) | | CAP | 192.168.10.6 | Wireless Access Point | @@ -60,6 +63,9 @@ ssh -i ~/.ssh/id_ed25519_unraid root@192.168.10.20 -p 422 # MikroTik Router ssh -i ~/.ssh/mikrotik_key -p 2222 xtrm@192.168.10.1 + +# Nobara (failover node) +ssh nobara ``` --- @@ -69,7 +75,8 @@ ssh -i ~/.ssh/mikrotik_key -p 2222 xtrm@192.168.10.1 1. **DNS down?** → Automatic failover to 192.168.10.10 (secondary), see `08-DNS-ARCHITECTURE.md` 2. **Internet down?** → Check HAP1 at 192.168.10.1 3. **Services down?** → Check Unraid at 192.168.10.20 -4. **Full outage?** → See `02-SERVICES-CRITICAL.md` startup order +4. **Unraid maintenance?** → VRRP failover to Nobara (192.168.10.250 VIP), see `10-FAILOVER-NOBARA.md` +5. **Full outage?** → See `02-SERVICES-CRITICAL.md` startup order --- diff --git a/docs/02-SERVICES-CRITICAL.md b/docs/02-SERVICES-CRITICAL.md index d14f71d..c994dbe 100644 --- a/docs/02-SERVICES-CRITICAL.md +++ b/docs/02-SERVICES-CRITICAL.md @@ -204,6 +204,25 @@ When recovering from full outage: --- +## Active Failover: XTRM-Nobara + +Critical services are replicated on the Nobara workstation with automatic VRRP failover: + +| Service | Primary (XTRM-U) | Failover (XTRM-Nobara) | +|---------|-------------------|------------------------| +| Traefik | 192.168.10.20 | 192.168.10.103 | +| Vaultwarden | 192.168.10.20 | 192.168.10.103 | +| Authentik | 192.168.10.20 | 192.168.10.103 | +| AdGuard Home | 192.168.10.20 | 192.168.10.103 | + +**VIP:** 192.168.10.250 (floats between XTRM-U and XTRM-Nobara via Keepalived VRRP) + +**Failover time:** ~4 seconds + +See: `10-FAILOVER-NOBARA.md` for full documentation. + +--- + ## Future: XTRM-N1 Survival Node When hardware upgrade completes, these services will have replicas on XTRM-N1: diff --git a/docs/04-HARDWARE-INVENTORY.md b/docs/04-HARDWARE-INVENTORY.md index 22cb18a..3ff7f4c 100644 --- a/docs/04-HARDWARE-INVENTORY.md +++ b/docs/04-HARDWARE-INVENTORY.md @@ -160,12 +160,34 @@ --- +## Workstations + +### XTRM-Nobara | Nobara Linux Workstation + +| Property | Value | +|----------|-------| +| **Role** | Workstation + Failover Node | +| **Location** | Main Bedroom | +| **IP** | 192.168.10.103 | +| **MAC** | 08:92:04:C6:07:C5 | +| **OS** | Nobara Linux (Fedora 43 based) | +| **CPU** | AMD Ryzen 9 6900HX (8C/16T) | +| **RAM** | 16 GB | +| **Storage** | 477GB NVMe (OS) + 1.8TB NVMe (btrfs pool with OS drive) | +| **Network** | enp5s0 (2.5G Ethernet) | +| **Switch Port** | CSS1-20 via PP1 M2 | +| **SSH** | `ssh nobara` (key: ~/.ssh/id_ed25519_nobara) | + +**Failover Services:** Traefik, Vaultwarden, Authentik, AdGuard Home +**Keepalived:** systemd service, BACKUP priority 100, VIP 192.168.10.250 + +--- + ## End Devices (Wired) | Device | Room | Outlet | Switch Port | MAC | |--------|------|--------|-------------|-----| | LGTV | Living Room | L3 | CSS1-24 | - | -| XTRM-Nobara | Main Bedroom | M2 | CSS1-20 | 08:92:04:C6:07:C5 | | Dell Display | Main Bedroom | M3 | CSS1-21 | - | | Dancho | Boys Room | B1 | CSS1-18 | - | | KVM Switch | - | Direct | CSS1-2 | - | diff --git a/docs/10-FAILOVER-NOBARA.md b/docs/10-FAILOVER-NOBARA.md new file mode 100644 index 0000000..ce16694 --- /dev/null +++ b/docs/10-FAILOVER-NOBARA.md @@ -0,0 +1,276 @@ +# Failover Infrastructure - Nobara (XTRM-Nobara) + +**Last Updated:** 2026-02-13 + +**Purpose:** Temporary failover for critical services during Unraid maintenance windows. + +--- + +## Overview + +A Docker-based replica of critical services runs on the Nobara Linux workstation (XTRM-Nobara) with automatic failover via Keepalived VRRP. When Unraid goes offline, the virtual IP floats to Nobara and services continue operating. + +``` +Clients → 192.168.10.250 (VIP) → XTRM-U (MASTER, priority 150) + ↓ failover (~4 seconds) + XTRM-Nobara (BACKUP, priority 100) +``` + +--- + +## Machines + +| Role | Host | IP | Interface | Priority | +|------|------|-----|-----------|----------| +| **MASTER** | XTRM-U (Unraid) | 192.168.10.20 | br0 | 150 | +| **BACKUP** | XTRM-Nobara | 192.168.10.103 | enp5s0 | 100 | +| **VIP** | Shared | 192.168.10.250 | — | — | + +--- + +## Replicated Services + +| Service | Image | Ports (Nobara) | Domain | +|---------|-------|----------------|--------| +| **Traefik** | traefik:latest | 80, 443, 8080 | *.xtrm-lab.org | +| **Vaultwarden** | vaultwarden/server:latest | internal:80 | vault.xtrm-lab.org | +| **Authentik** | ghcr.io/goauthentik/server:2025.8.1 | internal:9000 | auth.xtrm-lab.org | +| **Authentik Worker** | ghcr.io/goauthentik/server:2025.8.1 | — | — | +| **PostgreSQL** | postgres:17 | internal:5432 | — | +| **Redis** | redis:7-alpine | internal:6379 | — | +| **AdGuard Home** | adguard/adguardhome:latest | 192.168.10.103:53, 3000 | — | + +--- + +## File Locations + +### Nobara (XTRM-Nobara) + +| Path | Contents | +|------|----------| +| `/home/failover/docker-compose.yml` | Main compose stack | +| `/home/failover/traefik/` | Traefik config, certs, acme.json | +| `/home/failover/vaultwarden/` | Vaultwarden data (copy from Unraid) | +| `/home/failover/authentik/` | Authentik media & templates | +| `/home/failover/postgres/` | PostgreSQL data + initial dump | +| `/home/failover/redis/` | Redis data | +| `/home/failover/adguard/` | AdGuard conf & work dirs | +| `/etc/keepalived/keepalived.conf` | Keepalived VRRP config | +| `/usr/local/bin/check_failover.sh` | Health check script | +| `/usr/local/bin/failover-notify.sh` | State change notification script | +| `/var/log/keepalived-failover.log` | Failover event log | + +### Unraid (XTRM-U) + +| Path | Contents | +|------|----------| +| `/mnt/user/appdata/keepalived/keepalived.conf` | Keepalived VRRP config | +| `/mnt/user/appdata/keepalived/check_services.sh` | Health check script | + +--- + +## Keepalived Configuration + +### VRRP Parameters + +| Parameter | Value | +|-----------|-------| +| Virtual Router ID | 51 | +| Auth Type | PASS | +| Auth Password | xtrm2026 | +| Advertisement Interval | 1 second | +| Health Check Interval | 5 seconds | +| Fail Threshold | 3 missed checks | +| Recovery Threshold | 2 successful checks | + +### Unraid (MASTER) + +- Runs as Docker container: `local/keepalived` (built from alpine + keepalived + curl) +- Priority: 150 (+ health check weight 2 = 152 when healthy) +- Health check: curls `http://localhost:8183/api/overview` (Traefik dashboard) +- Preemption: enabled (will reclaim VIP from Nobara when healthy) + +```bash +# Start/stop on Unraid +docker start keepalived +docker stop keepalived +docker logs keepalived +``` + +### Nobara (BACKUP) + +- Runs as systemd service: `keepalived.service` +- Priority: 100 (+ health check weight 2 = 102 when healthy) +- Health check: verifies Traefik and Vaultwarden containers are running +- `nopreempt` set (won't fight for VIP if Unraid is healthy) + +```bash +# Start/stop on Nobara +sudo systemctl start keepalived +sudo systemctl stop keepalived +sudo journalctl -u keepalived -f +``` + +--- + +## DNS Strategy + +**Approach:** Local DNS override via AdGuard Home. + +To route traffic through the VIP for internal clients, configure AdGuard DNS rewrite rules to resolve `*.xtrm-lab.org` → `192.168.10.250`. External (Cloudflare) DNS remains pointed at Unraid's public IP. + +--- + +## Operations + +### Before Maintenance (Data Sync) + +Run these commands from the Mac to sync latest data to Nobara: + +```bash +# 1. Sync Vaultwarden data +ssh unraid "tar czf - -C /mnt/user/appdata vaultwarden/" | \ + ssh nobara "tar xzf - -C /home/failover/" + +# 2. Dump and sync Authentik database +ssh unraid "docker exec postgresql17 pg_dump -U authentik_user authentik_db" | \ + ssh nobara "cat > /home/failover/postgres/authentik_dump.sql" + +# 3. Sync AdGuard config +ssh unraid "tar czf - -C /mnt/user/appdata/adguardhome conf/ work/" | \ + ssh nobara "tar xzf - -C /home/failover/adguard/" + +# 4. Sync Traefik config and certs +ssh unraid "tar czf - -C /mnt/user/appdata/traefik traefik.yml dynamic.yml acme.json certs/" | \ + ssh nobara "tar xzf - -C /home/failover/traefik/" +``` + +**Note:** `ssh unraid` = `ssh -i ~/.ssh/id_ed25519_unraid -p 422 root@192.168.10.20` + +### Start Failover Services + +```bash +# On Nobara +cd /home/failover +sudo docker compose up -d +sudo systemctl start keepalived +``` + +### Stop Failover Services + +```bash +# On Nobara +cd /home/failover +sudo docker compose down +sudo systemctl stop keepalived +``` + +### Test Failover + +```bash +# 1. Check VIP location +ssh unraid "ip addr show br0 | grep inet" +ssh nobara "ip addr show enp5s0 | grep inet" + +# 2. Simulate Unraid failure +ssh unraid "docker stop keepalived" + +# 3. Verify VIP moved to Nobara (wait ~4 seconds) +ssh nobara "ip addr show enp5s0 | grep inet" + +# 4. Restore Unraid +ssh unraid "docker start keepalived" + +# 5. Verify VIP returned to Unraid +ssh unraid "ip addr show br0 | grep inet" +``` + +### Check Status + +```bash +# Nobara service status +ssh nobara "sudo docker ps --format 'table {{.Names}}\t{{.Status}}'" + +# Nobara keepalived state +ssh nobara "sudo journalctl -u keepalived -n 10 --no-pager" + +# Unraid keepalived state +ssh unraid "docker logs keepalived --tail 10" + +# Which machine holds the VIP? +ping -c 1 192.168.10.250 +``` + +--- + +## Traefik Configuration (Failover) + +The Nobara Traefik instance has a **reduced** dynamic.yml that only serves the four critical services: + +| Router | Domain | Backend | +|--------|--------|---------| +| vaultwarden-secure | vault.xtrm-lab.org | http://vaultwarden:80 | +| authentik-secure | auth.xtrm-lab.org | http://authentik:9000 | +| traefik-secure | traefik.xtrm-lab.org | api@internal | + +TLS certificates are shared (copied from Unraid's acme.json + static certs). + +--- + +## Limitations + +- **Data is a point-in-time snapshot.** Changes made on Unraid after the last sync are not reflected on Nobara. Re-sync before maintenance. +- **No real-time replication.** Vaultwarden passwords saved during failover will not sync back to Unraid automatically. +- **Only critical services replicated.** Other services (Plex, Gitea, NetBox, etc.) will be offline during maintenance. +- **External DNS not updated.** Failover only works for clients using the local DNS (AdGuard) that resolves to the VIP. External access via Cloudflare will not failover. + +--- + +## SSH Access + +```bash +# From Mac to Nobara (passwordless, key-based) +ssh nobara +# or: ssh -i ~/.ssh/id_ed25519_nobara jazzymc@192.168.10.103 + +# Sudo on Nobara requires password: (check password manager) +``` + +--- + +## Recovery After Maintenance + +1. Bring Unraid back online +2. Verify all Unraid services are running: `docker ps` +3. Keepalived on Unraid will auto-reclaim VIP (preemption) +4. Stop failover on Nobara: `cd /home/failover && sudo docker compose down` +5. If Vaultwarden was used during failover, manually export/import any new entries + +--- + +## Architecture Diagram + +``` + ┌─────────────────────┐ + │ 192.168.10.250 │ + │ (VRRP VIP) │ + └─────────┬───────────┘ + │ + ┌───────────────┼───────────────┐ + │ │ + ┌─────────▼─────────┐ ┌─────────▼─────────┐ + │ XTRM-U (Unraid) │ │ XTRM-Nobara │ + │ 192.168.10.20 │ │ 192.168.10.103 │ + │ MASTER (150) │ │ BACKUP (100) │ + │ │ │ │ + │ ┌──────────────┐ │ │ ┌──────────────┐ │ + │ │ Traefik │ │ │ │ Traefik │ │ + │ │ Vaultwarden │ │ │ │ Vaultwarden │ │ + │ │ Authentik │ │ │ │ Authentik │ │ + │ │ AdGuard │ │ │ │ AdGuard │ │ + │ │ + 25 more │ │ │ │ PostgreSQL │ │ + │ └──────────────┘ │ │ │ Redis │ │ + │ │ │ └──────────────┘ │ + │ Keepalived (Docker)│ │ Keepalived (systemd)│ + └────────────────────┘ └────────────────────┘ +``` diff --git a/docs/CHANGELOG.md b/docs/CHANGELOG.md index caed8f9..cf593d1 100644 --- a/docs/CHANGELOG.md +++ b/docs/CHANGELOG.md @@ -4,6 +4,25 @@ --- +## 2026-02-13 + +### Failover Infrastructure Deployed +- **[SERVICE]** Deployed Docker failover stack on XTRM-Nobara (Traefik, Vaultwarden, Authentik, AdGuard Home) +- **[SERVICE]** Installed Docker CE 29.2.1 + Docker Compose 5.0.2 on Nobara +- **[SERVICE]** Deployed Keepalived VRRP for automatic failover (VIP: 192.168.10.250) +- **[SERVICE]** Unraid: Keepalived as Docker container (local/keepalived, MASTER priority 150) +- **[SERVICE]** Nobara: Keepalived as systemd service (BACKUP priority 100) +- **[SERVICE]** Replicated data: Vaultwarden DB, Authentik PostgreSQL dump (864MB), AdGuard config, Traefik certs +- **[NETWORK]** Added VRRP protocol to Nobara firewall (firewalld) +- **[NETWORK]** Configured SSH key auth to Nobara (id_ed25519_nobara, passwordless) +- **[NETWORK]** Added SSH config alias: `ssh nobara` +- **[DOCS]** Created 10-FAILOVER-NOBARA.md with full failover documentation +- **[DOCS]** Updated 02-SERVICES-CRITICAL.md with failover section +- **[DOCS]** Updated 04-HARDWARE-INVENTORY.md with XTRM-Nobara specs +- **[DOCS]** Updated README.md and CLAUDE.md with Nobara references + +--- + ## 2026-02-06 ### Unraid Flash Drive Failure