Files
infrastructure/docs/02-SERVICES-CRITICAL.md
Kaloyan Danchev ecbce1ca94
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add VRRP failover infrastructure documentation (Nobara)
Deployed automatic failover for critical services (Traefik, Vaultwarden,
Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP
with VIP 192.168.10.250. ~4 second failover time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:03:26 +02:00

237 lines
5.0 KiB
Markdown

# Critical Services
**Last Updated:** 2026-01-31
Services that must remain operational for network functionality and security.
---
## Priority Levels
| Priority | Meaning | Recovery Target |
|----------|---------|-----------------|
| **P0** | Network offline without this | < 5 minutes |
| **P1** | Major functionality impacted | < 1 hour |
---
## P0 - Network Core
### DNS (AdGuard Home)
| Instance | Host | IP | Role |
|----------|------|-----|------|
| Primary | HAP1 (container) | 172.17.0.2 | Main DNS |
| Secondary | XTRM-U (macvlan) | 192.168.10.10 | Failover DNS |
**Failover:** Automatic via Netwatch (ping + DNS resolution checks)
**Config Sync:** adguardhome-sync (every 30 min, Unraid → MikroTik)
**Upstream:** Quad9 DoH (`https://dns.quad9.net/dns-query`)
**Web UI:**
- Primary: http://192.168.10.1:3000
- Secondary: http://192.168.10.10:3000
- Credentials: jazzymc / 7RqWElENNbZnPW
**Recovery:**
1. If primary fails → automatic failover to secondary (192.168.10.10)
2. Manual restart: `/container start [find name~"adguard"]`
---
### Routing (HAP1)
| Function | Details |
|----------|---------|
| WAN | 62.73.120.142 via Vivacom fiber |
| VLANs | 10 (Mgmt), 20 (Trusted), 25 (Kids), 30 (IoT), 40 (CatchAll) |
| NAT | Port forwarding to XTRM-U (192.168.10.20) |
| Firewall | RouterOS firewall rules |
**Recovery:**
1. Physical access to HAP1
2. Reset: hold reset button 5s
3. Reconfigure via WinBox or SSH (port 2222)
---
### DHCP (HAP1)
| VLAN | Pool | Range |
|------|------|-------|
| 10 (Mgmt) | pool-vlan10 | 192.168.10.100-200 |
| 20 (Trusted) | pool-vlan20 | 192.168.20.100-200 |
| 25 (Kids) | pool-vlan25 | 192.168.25.100-200 |
| 30 (IoT) | pool-vlan30 | 192.168.30.100-200 |
| 40 (CatchAll) | dhcp | 192.168.1.10-254 |
**Lease Time:** 30 minutes
---
## P1 - Authentication & Secrets
### Authentik (SSO)
| Component | IP | Purpose |
|-----------|-----|---------|
| authentik | 172.18.0.11 | Web UI + OIDC |
| authentik-worker | 172.18.0.12 | Background tasks |
**URL:** https://auth.xtrm-lab.org
**Protects:**
- Traefik forward auth (all *.xtrm-lab.org)
- Gitea OAuth
- Woodpecker OAuth
- NetBox OAuth
- NetDisco SSO
**Recovery:**
```bash
cd /mnt/user/appdata/authentik
docker compose up -d
```
**Database:** postgresql17 (authentik_db)
---
### Vaultwarden (Passwords)
| Component | IP | Purpose |
|-----------|-----|---------|
| vaultwarden | 172.18.0.15 | Password manager |
**URL:** https://vault.xtrm-lab.org
**Data:** `/mnt/user/appdata/vaultwarden/`
**Recovery:**
```bash
docker start vaultwarden
```
**Backup:** Part of Unraid flash backup
---
## P1 - Reverse Proxy
### Traefik
| Component | IP | Ports |
|-----------|-----|-------|
| traefik | 172.18.0.3 | 8001→80, 44301→443 |
**Config:** `/mnt/user/appdata/traefik/`
- `traefik.yml` - Static config
- `dynamic.yml` - Routers & services
**TLS:** Let's Encrypt wildcard for *.xtrm-lab.org
**Recovery:**
```bash
docker start traefik
```
---
## Shared Infrastructure
### PostgreSQL 17
| IP | Databases |
|----|-----------|
| 172.18.0.13 | authentik_db, netbox, gitea, netdisco_db, diode, hydra |
**Data:** `/mnt/user/appdata/postgresql17/`
**Recovery:**
```bash
docker start postgresql17
# Wait for DB to be ready before starting dependents
```
### Redis
| IP | Consumers |
|----|-----------|
| 172.18.0.14 | Authentik, NetBox, Diode |
**Recovery:**
```bash
docker start Redis
```
---
## Startup Order
When recovering from full outage:
1. **postgresql17** - Database (wait 30s)
2. **Redis** - Cache/queue (wait 10s)
3. **traefik** - Reverse proxy
4. **authentik** + **authentik-worker** - SSO
5. **vaultwarden** - Passwords
6. All other services
---
## Monitoring
### Uptime Kuma
| URL | Monitors |
|-----|----------|
| https://uptime.xtrm-lab.org | 27 services |
**Alerts:** Configured per service (email/webhook)
---
## Backup Strategy
| Data | Location | Frequency |
|------|----------|-----------|
| Unraid Flash | Google Drive | Daily |
| PostgreSQL | `/mnt/user/Backup/` | Daily |
| Vaultwarden | Unraid Flash | With flash backup |
| Authentik | PostgreSQL + `/mnt/user/appdata/authentik/` | Daily |
---
## Active Failover: XTRM-Nobara
Critical services are replicated on the Nobara workstation with automatic VRRP failover:
| Service | Primary (XTRM-U) | Failover (XTRM-Nobara) |
|---------|-------------------|------------------------|
| Traefik | 192.168.10.20 | 192.168.10.103 |
| Vaultwarden | 192.168.10.20 | 192.168.10.103 |
| Authentik | 192.168.10.20 | 192.168.10.103 |
| AdGuard Home | 192.168.10.20 | 192.168.10.103 |
**VIP:** 192.168.10.250 (floats between XTRM-U and XTRM-Nobara via Keepalived VRRP)
**Failover time:** ~4 seconds
See: `10-FAILOVER-NOBARA.md` for full documentation.
---
## Future: XTRM-N1 Survival Node
When hardware upgrade completes, these services will have replicas on XTRM-N1:
| Service | Primary | Replica |
|---------|---------|---------|
| DNS | HAP1 | XTRM-N1 |
| Vaultwarden | XTRM-N5 | XTRM-N1 |
| Authentik | XTRM-N5 | XTRM-N1 |
See: `wip/UPGRADE-2026-HARDWARE.md`