infrastructure/docs/02-SERVICES-CRITICAL.md

# Critical Services

**Last Updated:** 2026-01-31

Services that must remain operational for network functionality and security.

---

## Priority Levels

| Priority | Meaning | Recovery Target |
|----------|---------|-----------------|
| **P0** | Network offline without this | < 5 minutes |
| **P1** | Major functionality impacted | < 1 hour |

---

## P0 - Network Core

### DNS (AdGuard Home)

| Instance | Host | IP | Role |
|----------|------|-----|------|
| Primary | HAP1 (container) | 172.17.0.2 | Main DNS |
| Secondary | XTRM-U (macvlan) | 192.168.10.10 | Failover DNS |

**Failover:** Automatic via Netwatch (ping + DNS resolution checks)

**Config Sync:** adguardhome-sync (every 30 min, Unraid → MikroTik)

**Upstream:** Quad9 DoH (`https://dns.quad9.net/dns-query`)

**Web UI:**
- Primary: http://192.168.10.1:3000
- Secondary: http://192.168.10.10:3000
- Credentials: jazzymc / 7RqWElENNbZnPW

**Recovery:**
1. If primary fails → automatic failover to secondary (192.168.10.10)
2. Manual restart: `/container start [find name~"adguard"]`

---

### Routing (HAP1)

| Function | Details |
|----------|---------|
| WAN | 62.73.120.142 via Vivacom fiber |
| VLANs | 10 (Mgmt), 20 (Trusted), 25 (Kids), 30 (IoT), 40 (CatchAll) |
| NAT | Port forwarding to XTRM-U (192.168.10.20) |
| Firewall | RouterOS firewall rules |

**Recovery:**
1. Physical access to HAP1
2. Reset: hold reset button 5s
3. Reconfigure via WinBox or SSH (port 2222)

---

### DHCP (HAP1)

| VLAN | Pool | Range |
|------|------|-------|
| 10 (Mgmt) | pool-vlan10 | 192.168.10.100-200 |
| 20 (Trusted) | pool-vlan20 | 192.168.20.100-200 |
| 25 (Kids) | pool-vlan25 | 192.168.25.100-200 |
| 30 (IoT) | pool-vlan30 | 192.168.30.100-200 |
| 40 (CatchAll) | dhcp | 192.168.1.10-254 |

**Lease Time:** 30 minutes

---

## P1 - Authentication & Secrets

### Authentik (SSO)

| Component | IP | Purpose |
|-----------|-----|---------|
| authentik | 172.18.0.11 | Web UI + OIDC |
| authentik-worker | 172.18.0.12 | Background tasks |

**URL:** https://auth.xtrm-lab.org

**Protects:**
- Traefik forward auth (all *.xtrm-lab.org)
- Gitea OAuth
- Woodpecker OAuth
- NetBox OAuth
- NetDisco SSO

**Recovery:**
```bash
cd /mnt/user/appdata/authentik
docker compose up -d
```

**Database:** postgresql17 (authentik_db)

---

### Vaultwarden (Passwords)

| Component | IP | Purpose |
|-----------|-----|---------|
| vaultwarden | 172.18.0.15 | Password manager |

**URL:** https://vault.xtrm-lab.org

**Data:** `/mnt/user/appdata/vaultwarden/`

**Recovery:**
```bash
docker start vaultwarden
```

**Backup:** Part of Unraid flash backup

---

## P1 - Reverse Proxy

### Traefik

| Component | IP | Ports |
|-----------|-----|-------|
| traefik | 172.18.0.3 | 8001→80, 44301→443 |

**Config:** `/mnt/user/appdata/traefik/`
- `traefik.yml` - Static config
- `dynamic.yml` - Routers & services

**TLS:** Let's Encrypt wildcard for *.xtrm-lab.org

**Recovery:**
```bash
docker start traefik
```

---

## Shared Infrastructure

### PostgreSQL 17

| IP | Databases |
|----|-----------|
| 172.18.0.13 | authentik_db, netbox, gitea, netdisco_db, diode, hydra |

**Data:** `/mnt/user/appdata/postgresql17/`

**Recovery:**
```bash
docker start postgresql17
# Wait for DB to be ready before starting dependents
```

### Redis

| IP | Consumers |
|----|-----------|
| 172.18.0.14 | Authentik, NetBox, Diode |

**Recovery:**
```bash
docker start Redis
```

---

## Startup Order

When recovering from full outage:

1. **postgresql17** - Database (wait 30s)
2. **Redis** - Cache/queue (wait 10s)
3. **traefik** - Reverse proxy
4. **authentik** + **authentik-worker** - SSO
5. **vaultwarden** - Passwords
6. All other services

---

## Monitoring

### Uptime Kuma

| URL | Monitors |
|-----|----------|
| https://uptime.xtrm-lab.org | 27 services |

**Alerts:** Configured per service (email/webhook)

---

## Backup Strategy

| Data | Location | Frequency |
|------|----------|-----------|
| Unraid Flash | Google Drive | Daily |
| PostgreSQL | `/mnt/user/Backup/` | Daily |
| Vaultwarden | Unraid Flash | With flash backup |
| Authentik | PostgreSQL + `/mnt/user/appdata/authentik/` | Daily |

---

## Future: XTRM-N1 Survival Node

When hardware upgrade completes, these services will have replicas on XTRM-N1:

| Service | Primary | Replica |
|---------|---------|---------|
| DNS | HAP1 | XTRM-N1 |
| Vaultwarden | XTRM-N5 | XTRM-N1 |
| Authentik | XTRM-N5 | XTRM-N1 |

See: `wip/UPGRADE-2026-HARDWARE.md`