Compare commits

...

7 Commits

Author SHA1 Message Date
jazzymc
d0b4fae25e WiFi troubleshooting guide, fix empty security overrides, update config docs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-12 16:18:50 +02:00
Kaloyan Danchev
6320c0f8d9 Docs: Claude Code tooling setup on Unraid — Cooperator, glab, skills, MCP prep
Installed Cooperator CLI, glab, uv+Python 3.12, 6 custom skills,
and built MCP servers (shortcut, mikrotik, unraid). MCP registration
via `claude mcp add` still pending as TODO.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 22:13:44 +02:00
jazzymc
8aef54992a Docker audit: migrate all containers to Dockge, clean up Traefik config
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
2026-02-28 20:39:16 +02:00
Kaloyan Danchev
7867b5c950 WiFi VLAN fixes, CAP bridge filtering, AdGuard IP conflicts, channel optimization
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
- Enable bridge VLAN filtering on CAP for proper per-client VLAN assignment
- Fix AdGuard container IP conflicts (.2→.10, .3→.11) with static IPs
- Fix 2.4GHz co-channel interference (both APs were on ch 1, CAP now ch 6)
- Fix 5GHz overlap (HAP ch 36/5180, CAP moved to ch 52/5260)
- Update WiFi access-list: VLAN assignment now active with per-device VLAN IDs
- Add Xiaomi Air Purifier MC1 to VLAN 30 access-list

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 09:40:29 +02:00
Kaloyan Danchev
cdb961f943 Post-migration container cleanup: fix broken services, remove obsolete containers
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Fixed Traefik networking (stale Docker bridge), adguardhome-sync config,
diode stack (Hydra DB + OAuth2 bootstrap), diode-agent auth. Removed 5
deprecated/duplicate containers. Started unmarr + rustfs stacks. 53
containers now running.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 17:30:15 +02:00
Kaloyan Danchev
877aa71d3e Update docs: motherboard swap, NVMe cache pool, Docker migration
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
- New motherboard installed, MAC/DHCP updated
- 3x Samsung 990 EVO Plus 1TB NVMe cache pool (ZFS RAIDZ1)
- Docker migrated from HDD loopback to NVMe ZFS storage driver
- disk1 confirmed dead (clicking heads), still on parity emulation
- Hardware inventory, changelog, and incident report updated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 14:47:07 +02:00
Kaloyan Danchev
bf6a62a275 Add incident report: disk1 hardware failure (clicking/head crash)
HGST Ultrastar 10TB drive (serial 2TKK3K1D) failed on Feb 18.
Array running degraded on parity emulation. Recovery plan documented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 17:54:23 +02:00
9 changed files with 825 additions and 49 deletions

View File

@@ -68,6 +68,7 @@ infrastructure/
├── 08-DNS-ARCHITECTURE.md # DNS failover architecture ├── 08-DNS-ARCHITECTURE.md # DNS failover architecture
├── 09-TAILSCALE-VPN.md # Tailscale VPN setup ├── 09-TAILSCALE-VPN.md # Tailscale VPN setup
├── 10-FAILOVER-NOBARA.md # VRRP failover to Nobara ├── 10-FAILOVER-NOBARA.md # VRRP failover to Nobara
├── 12-WIFI-TROUBLESHOOTING.md # WiFi/CAPsMAN troubleshooting guide
├── CHANGELOG.md # Change history ├── CHANGELOG.md # Change history
├── archive/ # Completed/legacy docs ├── archive/ # Completed/legacy docs
│ └── vlan-migration/ # VLAN migration project artifacts │ └── vlan-migration/ # VLAN migration project artifacts

View File

@@ -1,6 +1,6 @@
# Other Services # Other Services
**Last Updated:** 2026-02-14 **Last Updated:** 2026-02-24
Non-critical services that enhance functionality but don't affect core network operation. Non-critical services that enhance functionality but don't affect core network operation.
@@ -300,3 +300,8 @@ Non-critical services that enhance functionality but don't affect core network o
| Pi-hole | Replaced by AdGuard Home | Removed | | Pi-hole | Replaced by AdGuard Home | Removed |
| Pangolin | Not in use | Removed | | Pangolin | Not in use | Removed |
| Slurp'it | Replaced by Diode | Removed | | Slurp'it | Replaced by Diode | Removed |
| binhex-plexpass | Duplicate of Plex | Removed |
| HomeAssistant_inabox | Duplicate of Home-Assistant-Container | Removed |
| Docker-WebUI | Unused, non-functional | Removed |
| hass-unraid | No config, unused | Removed |
| nextcloud-aio-mastercontainer | Replaced by Nextcloud container | Removed |

View File

@@ -1,6 +1,6 @@
# Hardware Inventory # Hardware Inventory
**Last Updated:** 2026-02-14 **Last Updated:** 2026-02-24
--- ---
@@ -109,18 +109,27 @@
| **IP** | 192.168.10.20 | | **IP** | 192.168.10.20 |
| **OS** | Unraid 6.x | | **OS** | Unraid 6.x |
**Motherboard:** Replaced 2026-02-24 (new board, details TBD)
**Network:** **Network:**
| Interface | MAC | Speed | | Interface | MAC | Speed |
|-----------|-----|-------| |-----------|-----|-------|
| eth1 | A8:B8:E0:02:B6:15 | 2.5G | | br0 | 38:05:25:35:8E:7A | 2.5G |
| eth2 | A8:B8:E0:02:B6:16 | 2.5G |
| eth3 | A8:B8:E0:02:B6:17 | 2.5G |
| eth4 | A8:B8:E0:02:B6:18 | 2.5G |
| **bond0** | (virtual) | 5G aggregate |
**Storage:** **Storage:**
- Cache: (current NVMe) | Device | Model | Size | Role | Status |
- Array: 3.5" HDDs |--------|-------|------|------|--------|
| sdb | HUH721010ALE601 (serial 7PHBNYZC) | 10TB | Parity | OK |
| disk1 | HUH721010ALE601 (serial 2TKK3K1D) | 10TB | Data (ZFS) | **FAILED** — clicking/head crash, emulated from parity |
| nvme0n1 | Samsung 990 EVO Plus 1TB | 1TB | Cache pool (RAIDZ1) | OK |
| nvme1n1 | Samsung 990 EVO Plus 1TB | 1TB | Cache pool (RAIDZ1) | OK |
| nvme2n1 | Samsung 990 EVO Plus 1TB | 1TB | Cache pool (RAIDZ1) | OK |
**ZFS Pools:**
| Pool | Devices | Profile | Usable | Purpose |
|------|---------|---------|--------|---------|
| disk1 | md1p1 (parity-emulated) | single | 9.1TB | Main data (roms, media, appdata, backups) |
| cache | 3x Samsung 990 EVO Plus 1TB NVMe | RAIDZ1 | ~1.8TB | Docker, containers |
**Virtual IPs:** **Virtual IPs:**
| IP | Purpose | | IP | Purpose |
@@ -223,6 +232,7 @@ See: `wip/UPGRADE-2026-HARDWARE.md`
|--------|------|--------| |--------|------|--------|
| XTRM-N5 (Minisforum N5 Air) | Production server | Planned | | XTRM-N5 (Minisforum N5 Air) | Production server | Planned |
| XTRM-N1 (N100 ITX) | Survival node | Planned | | XTRM-N1 (N100 ITX) | Survival node | Planned |
| 3x Samsung 990 EVO Plus 1TB | XTRM-N5 NVMe pool | Planned | | 3x Samsung 990 EVO Plus 1TB | XTRM-U cache pool (RAIDZ1) | **Installed** 2026-02-24 |
| 2x Fikwot FX501Pro 512GB | XTRM-N1 mirror | Planned | | 2x Fikwot FX501Pro 512GB | XTRM-N1 mirror | Planned |
| 1x 10TB+ HDD | Replace failed disk1 | **Needed** |
| MikroTik CRS310-8G+2S+IN | Replace ZX1 | Future | | MikroTik CRS310-8G+2S+IN | Replace ZX1 | Future |

View File

@@ -1,6 +1,6 @@
# WiFi and CAPsMAN Configuration # WiFi and CAPsMAN Configuration
**Last Updated:** 2026-02-14 **Last Updated:** 2026-03-12
**Purpose:** Document WiFi network settings, CAPsMAN configuration, and device compatibility requirements **Purpose:** Document WiFi network settings, CAPsMAN configuration, and device compatibility requirements
--- ---
@@ -23,11 +23,12 @@
| SSID | XTRM | | SSID | XTRM |
| Band | 5GHz | | Band | 5GHz |
| Mode | 802.11ax (WiFi 6) | | Mode | 802.11ax (WiFi 6) |
| Channel | Auto (DFS enabled) | | Channel | 5745 MHz (ch 149) |
| Width | 80MHz | | Width | 20/40/80MHz |
| Security | WPA2-PSK + WPA3-PSK | | Security | WPA2-PSK + WPA3-PSK |
| Cipher | CCMP (AES) | | Cipher | CCMP (AES) |
| 802.11r (FT) | Enabled | | 802.11r (FT) | Disabled |
| Skip DFS | All |
| Password | `M0stW4nt3d@home` | | Password | `M0stW4nt3d@home` |
--- ---
@@ -44,12 +45,14 @@ Some devices (Tuya JMWZG1 gateway, Amazfit TREX3, iPad 2) require legacy setting
|---------|-------|--------| |---------|-------|--------|
| SSID | XTRM2 | | | SSID | XTRM2 | |
| Band | 2.4GHz | IoT compatibility | | Band | 2.4GHz | IoT compatibility |
| Mode | **802.11g** | Legacy device support | | Mode | **802.11g** | Legacy device support (NOT 802.11n — breaks IoT) |
| Channel | **1 (2412 MHz)** | Most compatible | | Channel | **1 (2412 MHz)** | Most compatible |
| Width | **20MHz** | Required for old devices | | Width | **20MHz** | Required for old devices |
| Security | **WPA-PSK + WPA2-PSK** | WPA needed for legacy | | Security | **WPA-PSK + WPA2-PSK** | WPA needed for legacy |
| Cipher | **TKIP + CCMP** | TKIP required for old devices | | Cipher | **TKIP + CCMP** | TKIP required for old devices |
| 802.11r (FT) | **Disabled** | Causes issues with IoT | | 802.11r (FT) | **Disabled** | Causes issues with IoT |
**CRITICAL:** Security must be set explicitly on the interface, not just the profile. Empty `security.authentication-types=""` means OPEN network, not "inherit from profile." See [12-WIFI-TROUBLESHOOTING.md](12-WIFI-TROUBLESHOOTING.md).
| Password | `M0stW4nt3d@IoT` | | | Password | `M0stW4nt3d@IoT` | |
### Devices Requiring WPA + TKIP ### Devices Requiring WPA + TKIP
@@ -98,44 +101,73 @@ If devices still can't connect, use WPA-only with TKIP-only:
| Interfaces | bridge, vlan10-mgmt | | Interfaces | bridge, vlan10-mgmt |
| Certificate | Auto-generated | | Certificate | Auto-generated |
### CAP Device (CAP XL ac - 192.168.10.2) ### CAP Device (cAP XL ac - 192.168.10.2)
| Setting | Value | | Setting | Value |
|---------|-------| |---------|-------|
| caps-man-addresses | 192.168.10.1 | | caps-man-addresses | 192.168.10.1 |
| discovery-interfaces | bridgeLocal |
| slaves-datapath | capdp (bridge=bridgeLocal, vlan-id=40) |
| certificate | request | | certificate | request |
| RouterOS | 7.21.1 | | RouterOS | 7.21.1 |
| SSH Port | 2222 | | SSH Port | 2222 |
| SSH | `ssh -i ~/.ssh/mikrotik_key -p 2222 xtrm@192.168.10.2` | | SSH (via proxy) | See ProxyJump command below |
**Note:** CAP was factory reset on 2026-02-13. CAPsMAN certificate was regenerated and CAP re-enrolled with `certificate=request`. **SSH Access:** Direct SSH to CAP is unreliable. Use ProxyJump through Unraid:
```bash
ssh -o ProxyCommand="ssh -i ~/.ssh/id_ed25519_unraid -p 422 -W %h:%p root@192.168.10.20" -i ~/.ssh/mikrotik_key -p 2222 xtrm@192.168.10.2
```
### CAP Bridge VLAN Filtering
The CAP runs bridge VLAN filtering to properly tag/untag WiFi client traffic before sending it to the HAP over the trunk link (ether1):
| Setting | Value |
|---------|-------|
| bridgeLocal | vlan-filtering=yes, pvid=10 |
| ether1 (trunk) | bridge port, PVID=10 |
| wifi1, wifi2 | dynamic bridge ports, PVID=40 (set by datapath vlan-id) |
**Bridge VLAN Table:**
| VLAN | ether1 | wifi1 | wifi2 | bridgeLocal | Purpose |
|------|--------|-------|-------|-------------|---------|
| 10 | untagged | - | - | untagged | Management |
| 20 | tagged | tagged | tagged | - | Trusted |
| 25 | tagged | tagged | tagged | - | Kids |
| 30 | tagged | tagged | tagged | - | IoT |
| 35 | tagged | tagged | tagged | - | Cameras |
| 40 | tagged | untagged | untagged | - | CatchAll (default) |
### CAP Interfaces ### CAP Interfaces
| Interface | Radio | Band | SSID | Security | Status | | Interface | Radio | Band | SSID | Security | Status |
|-----------|-------|------|------|----------|--------| |-----------|-------|------|------|----------|--------|
| cap-wifi1 | wifi1 | 2.4GHz | XTRM2 | WPA2-PSK, CCMP | Working | | cap-wifi1 | MAC :BE | 2.4GHz | XTRM2 | WPA+WPA2, TKIP+CCMP | Working (Ch 13/2472, 20MHz, 802.11g) |
| cap-wifi2 | wifi2 | 5GHz | XTRM | WPA2/WPA3-PSK | Working (Ch 5220, 20/40MHz) | | cap-wifi2 | MAC :BF | 5GHz | XTRM | WPA2/WPA3-PSK, CCMP | Working (Ch 36/5180, 20/40/80MHz, 802.11ac) |
**Note:** cap-wifi1 uses cfg-xtrm2 but with WPA2+CCMP only (not WPA+TKIP like the local wifi2). Legacy IoT devices requiring TKIP will only work on HAP1's local wifi2. **Note:** CAP radios swapped after CAPsMAN re-provisioning. Identify by MAC address, not interface name. See [12-WIFI-TROUBLESHOOTING.md](12-WIFI-TROUBLESHOOTING.md) for re-provisioning procedures.
--- ---
## WiFi Access List ## WiFi Access List
**Status:** VLAN assignment via access list is **not active** (rolled back 2026-01-27). All entries use `action=accept` without VLAN ID. Devices get their VLAN via DHCP static leases on the bridge. **Status:** VLAN assignment via access list is **active**. Each entry has a `vlan-id` that assigns the device to the correct VLAN upon WiFi association. This works on both HAP (local) and CAP (remote, via bridge VLAN filtering).
**29 entries** configured (MAC-based accept rules + 1 default catch-all): **30+ entries** configured (MAC-based accept rules with VLAN IDs + 1 default catch-all):
| # | MAC | Device | Notes | | # | MAC | Device | VLAN |
|---|-----|--------|-------| |---|-----|--------|------|
| 0 | AA:ED:8B:2A:40:F1 | Samsung S25 Ultra - Kaloyan | | | 0 | AA:ED:8B:2A:40:F1 | Samsung S25 Ultra - Kaloyan | 20 |
| 1 | 82:6D:FB:D9:E0:47 | MacBook Air - Nora | | | 1 | 82:6D:FB:D9:E0:47 | MacBook Air - Nora | 20 |
| 12 | CE:B8:11:EA:8D:55 | MacBook - Kaloyan | | | 12 | CE:B8:11:EA:8D:55 | MacBook - Kaloyan | 20 |
| 13 | BE:A7:95:87:19:4A | MacBook 5GHz - Kaloyan | | | 13 | BE:A7:95:87:19:4A | MacBook 5GHz - Kaloyan | 20 |
| 27 | B8:27:EB:32:B2:13 | RecalBox RPi3 | VLAN 25 (Kids) | | 27 | B8:27:EB:32:B2:13 | RecalBox RPi3 | 25 |
| 28 | CC:5E:F8:D3:37:D3 | ASUS ROG Ally - Kaloyan | | | 28 | CC:5E:F8:D3:37:D3 | ASUS ROG Ally - Kaloyan | 20 |
| 29 | (any) | Default - VLAN40 | Catch-all | | 31 | C8:5C:CC:40:B4:AA | Xiaomi Air Purifier 2 | 30 |
| 32 | (any) | Default - VLAN40 | 40 (catch-all) |
**Default behavior:** Devices not in the access list get VLAN 40 (CatchAll) via the default rule and the datapath `vlan-id=40`.
### Show Full Access List ### Show Full Access List

View File

@@ -1,6 +1,6 @@
# DNS Architecture with AdGuard Failover # DNS Architecture with AdGuard Failover
**Last Updated:** 2026-02-06 **Last Updated:** 2026-02-26
--- ---
@@ -194,8 +194,10 @@ Settings are synced from Unraid (source of truth) to MikroTik every 30 minutes.
### Sync Container ### Sync Container
Container: `adguardhome-sync` at 192.168.10.11 (br0 macvlan, static IP)
```yaml ```yaml
# /mnt/user/appdata/adguard-sync/adguardhome-sync.yaml # /mnt/user/appdata/dockge/stacks/adguard-sync/adguardhome-sync.yaml
cron: "*/30 * * * *" cron: "*/30 * * * *"
runOnStart: true runOnStart: true
@@ -204,22 +206,13 @@ origin:
username: jazzymc username: jazzymc
password: 7RqWElENNbZnPW password: 7RqWElENNbZnPW
replicas: replica:
- url: http://192.168.10.1:3000 url: http://192.168.10.1:3000
username: jazzymc username: jazzymc
password: 7RqWElENNbZnPW password: 7RqWElENNbZnPW
features:
dns:
serverConfig: false
accessLists: true
rewrites: true
filters: true
clientSettings: true
services: true
``` ```
**Note:** The sync container must be connected to both `dockerproxy` and `br0` networks to reach both AdGuard instances. **Note:** The sync container is on the `br0` macvlan network with a static IP to avoid conflicts with infrastructure devices.
--- ---

View File

@@ -0,0 +1,275 @@
# Development Environment
**Last Updated:** 2026-03-08
Web-based development environment running directly on Unraid, providing VS Code IDE with full host access to Claude Code, Cooperator CLI, Docker, and all project repositories.
---
## OpenVSCode Server
| Property | Value |
|----------|-------|
| **URL** | https://code.xtrm-lab.org |
| **Auth** | Authentik forward auth (SSO) |
| **Port** | 3100 (host-native, not a container) |
| **Binary** | `/mnt/user/appdata/openvscode/current/` (symlink) |
| **Config** | `/mnt/user/appdata/openvscode/config/` |
| **Boot Script** | `/mnt/user/appdata/openvscode/start.sh` |
| **Log** | `/mnt/user/appdata/openvscode/server.log` |
**Why host-native?** Running directly on Unraid (not in a container) means the VS Code terminal has full access to `claude`, `cooperator`, `node`, `npm`, `docker`, `git`, and all host tools. No volume mount hacks or container-breaking updates.
### Persistence
All data lives on the array (`/mnt/user/`) — survives Unraid OS updates:
| Component | Path | Purpose |
|-----------|------|---------|
| Server binary | `/mnt/user/appdata/openvscode/openvscode-server-v1.109.5-linux-x64/` | VS Code server |
| Symlink | `/mnt/user/appdata/openvscode/current` → version dir | Easy version switching |
| VS Code config | `/mnt/user/appdata/openvscode/config/` | Extensions, settings, themes |
| Start script | `/mnt/user/appdata/openvscode/start.sh` | Startup with PATH setup |
### Updating OpenVSCode Server
```bash
# Download new version
cd /mnt/user/appdata/openvscode
curl -fsSL "https://github.com/gitpod-io/openvscode-server/releases/download/openvscode-server-vX.Y.Z/openvscode-server-vX.Y.Z-linux-x64.tar.gz" -o new.tar.gz
tar xzf new.tar.gz && rm new.tar.gz
# Switch symlink and restart
ln -sfn openvscode-server-vX.Y.Z-linux-x64 current
pkill -f "openvscode-server.*--port 3100"
/mnt/user/appdata/openvscode/start.sh
```
Extensions and settings are preserved (stored separately in `config/`).
### Traefik Routing
Defined in `/mnt/user/appdata/traefik/dynamic.yml`:
```yaml
openvscode-secure:
rule: "Host(`code.xtrm-lab.org`)"
entryPoints: [https]
middlewares: [default-headers, authentik-forward-auth]
tls:
certResolver: cloudflare
service: openvscode
# ...
openvscode:
loadBalancer:
servers:
- url: "http://192.168.10.20:3100"
```
---
## Claude Code
| Property | Value |
|----------|-------|
| **Version** | 2.1.71 |
| **Binary** | `/mnt/user/appdata/claude-code/.npm-global/bin/claude` |
| **Symlink** | `/root/.local/bin/claude` |
| **Config** | `/mnt/user/appdata/claude-code/.claude.json``/root/.claude.json` |
| **Settings** | `/mnt/user/appdata/claude-code/.claude/``/root/.claude/` |
| **Boot Script** | `/mnt/user/appdata/claude-code/install-claude.sh` |
### Persistence
npm global prefix set to `/mnt/user/appdata/claude-code/.npm-global/` (array-backed). Boot script creates symlinks from `/root/` to persistent paths.
### Updating Claude Code
```bash
source /root/.bashrc
npm install -g @anthropic-ai/claude-code
claude --version
```
---
## Cooperator CLI
| Property | Value |
|----------|-------|
| **Version** | 3.36.1 |
| **Binary** | `/mnt/user/appdata/claude-code/.npm-global/bin/cooperator` |
| **Config** | `~/.cooperator/.env` (Shortcut token, Confluence, git config) |
| **Registry** | `@ampeco:registry=https://gitlab.com/api/v4/projects/71775017/packages/npm/` |
| **npm auth** | `/root/.npmrc` (GitLab PAT) |
### What Cooperator Install Sets Up
- **Commands** — `~/.claude/commands/cooperator` → cooperator's claude-commands
- **Agents** — `~/.claude/agents/implementation-task-executor.md`
- **Skills** — 12 cooperator skills (shortcut-operations, create-feature-story, gitlab-operations, etc.)
- **Shortcut API** — validated via `~/.cooperator/.env` token
### Updating Cooperator
```bash
source /root/.bashrc
npm install -g @ampeco/cooperator
cooperator --version
```
**Note:** `/root/.npmrc` is in RAM — recreated on boot if needed. The GitLab PAT is stored in `/boot/config/go` would need a persistent `.npmrc` setup if token changes frequently.
---
## GitLab CLI (glab)
| Property | Value |
|----------|-------|
| **Version** | 1.89.0 |
| **Binary** | `/usr/local/bin/glab` (RAM — lost on reboot) |
| **Config** | `~/.config/glab-cli/config.yml` |
| **Auth** | GitLab PAT (same as npm registry token) |
**Note:** glab binary at `/usr/local/bin/` is lost on Unraid reboot. Add to boot script or persist to appdata.
---
## Python (via uv)
| Property | Value |
|----------|-------|
| **uv** | `/root/.local/bin/uv` |
| **Python** | 3.12.13 (managed by uv) |
| **mikrotik-mcp venv** | `/mnt/user/projects/mikrotik-mcp/venv/` |
| **unraid-mcp venv** | `/mnt/user/projects/unraid-mcp/.venv/` |
---
## Custom Skills
6 custom skills synced from Mac to `/mnt/user/appdata/claude-code/custom-skills/`:
| Skill | Description |
|-------|-------------|
| ev-compliance-story | EV regulatory compliance story creation |
| ev-protocol-expert | OCPP/OCPI/AFIR protocol expertise |
| frontend-designer | Nova/Vue component design |
| mikrotik-admin | MikroTik router management via MCP |
| prd-generator | Product requirements documents |
| unraid-admin | Unraid server management via MCP |
Symlinked to `~/.claude/skills/` alongside 12 cooperator skills (18 total).
---
## MCP Servers
### Registered (TODO)
The following MCP servers need to be registered via `claude mcp add` on Unraid:
| Server | Command | Status |
|--------|---------|--------|
| **shortcut** | `node /mnt/user/appdata/claude-code/mcp-server-shortcut/dist/index.js` | Built, needs `claude mcp add` |
| **mikrotik** | `/mnt/user/projects/mikrotik-mcp/venv/bin/python -m mikrotik_mcp.server` | Venv ready, needs `claude mcp add` |
| **unraid** | `/mnt/user/projects/unraid-mcp/.venv/bin/python -m unraid_mcp.main` | Venv ready, needs `claude mcp add` |
| **playwright** | `npx -y @playwright/mcp@latest --isolated` | npx available, needs `claude mcp add` |
| **smartbear** | `npx -y @smartbear/mcp@latest` | npx available, needs `claude mcp add` |
### Environment Variables for MCPs
- **mikrotik**: `DEVICES_PATH=/mnt/user/projects/mikrotik-mcp/devices.json`
- **unraid**: `UNRAID_API_URL`, `UNRAID_API_KEY`, `UNRAID_MCP_TRANSPORT=stdio`, `UNRAID_VERIFY_SSL=false`
- **shortcut**: `SHORTCUT_API_TOKEN` (from `~/.cooperator/.env`)
---
## Projects Workspace
All projects at `/mnt/user/projects/`, opened as default folder in VS Code.
### Personal Projects (Gitea)
| Project | Gitea Repo | Description |
|---------|-----------|-------------|
| infrastructure | jazzymc/infrastructure | This repo — home lab documentation |
| claude-skills | jazzymc/claude-skills | Claude Code custom skills |
| mikrotik-mcp | jazzymc/mikrotik-mcp | MikroTik MCP server |
| unraid-mcp | jazzymc/unraid-mcp | Unraid MCP server |
| unraid-glass | jazzymc/unraid-glass | Unraid dashboard plugin |
| openclaw | jazzymc/openclaw | OpenClaw game project |
| nanobot-mcp | jazzymc/nanobot-mcp | Nanobot MCP server |
| nanobot-hkuds | jazzymc/nanobot-hkuds | Nanobot HKU DS |
| xtrm-agent | jazzymc/xtrm-agent | AI agent framework |
| geekmagic-smalltv | jazzymc/geekmagic-smalltv | SmallTV firmware |
| homarr | jazzymc/homarr | Homarr dashboard fork |
| shortcut-daily-sync | jazzymc/shortcut-daily-sync | Shortcut sync tool |
**Remote URL format:** `https://jazzymc:<token>@git.xtrm-lab.org/jazzymc/<repo>.git`
### AMPECO Work Projects
| Project | Source | Type |
|---------|--------|------|
| backend | GitLab (ampeco/apps/charge/backend) | Git clone |
| crm | GitLab (ampeco/apps/charge/crm) | Git clone |
| marketplace | GitLab (ampeco/apps/charge/marketplace) | Git clone |
| mobile-2 | GitLab (ampeco/apps/charge/mobile-2) | Git clone |
| ad-hoc-payment-web-app | GitLab (ampeco/apps/charge/external-apps/) | Git clone |
| dev-proxy | GitLab (ampeco/apps/shared/dev-proxy) | Git clone |
| ampeco-custom-dashboard-widgets-boilerplate | GitHub (ampeco/) | Git clone |
| docs | Local rsync | Reference docs |
| stories | Local rsync | Product stories |
| booking-ewa | Local rsync | Booking app |
| ewa-ui | Local rsync | EWA frontend |
| design-tokens | Local rsync | Design system tokens |
| ampeco-backup | Local rsync | Configuration backups |
| central_registry | Local rsync | Service registry |
| CCode-UI-Distribution-1.0.0 | Local rsync | UI distribution |
| automations | Local rsync | Automation scripts |
**GitLab auth:** OAuth2 PAT in remote URLs.
---
## Boot Sequence
`/boot/config/go` triggers on Unraid boot:
1. **Wait for array** — polls for `/mnt/user/appdata/claude-code` (up to 5 min)
2. **Claude Code setup**`/mnt/user/appdata/claude-code/install-claude.sh`
- Creates symlinks (`/root/.local/bin/claude`, `/root/.claude`, `/root/.claude.json`)
- Writes `.bashrc` with persistent npm PATH
3. **OpenVSCode Server**`/mnt/user/appdata/openvscode/start.sh`
- Kills any existing instance
- Starts on port 3100 with persistent config dir
- Sources Claude/Cooperator PATH for terminal sessions
---
## Architecture Diagram
```
Browser → https://code.xtrm-lab.org
Traefik (443) → Authentik SSO check
OpenVSCode Server (:3100, host-native)
Unraid Host Shell
├── claude (2.1.71)
├── cooperator (3.36.1)
├── glab (1.89.0)
├── node (22.18.0) / npm (10.9.3) / bun (1.3.10)
├── uv + python 3.12
├── docker / docker compose
├── git
└── /mnt/user/projects/
├── ampeco/ (18 AMPECO work projects)
├── infrastructure/
├── claude-skills/
├── mikrotik-mcp/
└── ... (12 personal repos)
```

View File

@@ -0,0 +1,237 @@
# WiFi / CAPsMAN Troubleshooting Guide
**Last Updated:** 2026-03-12
**Purpose:** Document known pitfalls, root causes, and diagnostic procedures for WiFi and CAPsMAN issues on the MikroTik HAP ax³ + cAP XL ac setup.
---
## Pitfall 1: Empty Inline Security Overrides = Open Network
**Severity:** CRITICAL
**Problem:** Setting `security.authentication-types=""` on a WiFi interface does NOT mean "inherit from security profile." RouterOS interprets an empty string as **no authentication (open network)**.
**Symptoms:**
- Devices connect but show empty AUTH-TYPE in registration table
- IoT devices that try WPA/WPA2 handshake silently fail — router logs show ZERO connection attempts
- Other devices connect fine (they accept open)
**Root Cause:** Attempting to "clear inline overrides" to inherit from the security profile by setting empty values. RouterOS treats empty string as explicit "no auth."
**Fix:**
```routeros
# Always set explicit values on the interface
/interface wifi set wifi2 \
security.authentication-types=wpa-psk,wpa2-psk \
security.encryption=tkip,ccmp
```
**Rule:** NEVER set `security.authentication-types=""` or `security.encryption=""`. Always use explicit values matching the security profile.
---
## Pitfall 2: CAPsMAN Re-Provisioning Wipes Interface Config
**Severity:** HIGH
**Problem:** Running `/interface wifi capsman remote-cap provision` clears all configuration from cap-wifi interfaces — security, channel, datapath, and SSID are all removed. Interfaces show "SSID not set" and remain inactive.
**Fix:** After re-provisioning, manually re-apply full config:
```routeros
# 2.4GHz (cap-wifi1 = MAC :BE = 2.4GHz radio)
/interface wifi set cap-wifi1 \
configuration=cfg-xtrm2 security=sec-xtrm2 datapath=dp-cap \
channel.frequency=2472 channel.band=2ghz-g channel.width=20mhz
# 5GHz (cap-wifi2 = MAC :BF = 5GHz radio)
/interface wifi set cap-wifi2 \
configuration=cfg-xtrm security=sec-xtrm datapath=dp-cap \
channel.frequency=5180 channel.band=5ghz-ac \
channel.width=20/40/80mhz channel.skip-dfs-channels=all
# Re-enable both
/interface wifi enable cap-wifi1
/interface wifi enable cap-wifi2
```
---
## Pitfall 3: Interface IDs Change After Re-Provisioning
**Severity:** HIGH
**Problem:** After CAPsMAN re-provisioning, cap-wifi interface internal IDs change (e.g., `*20`/`*21` become `*22`/`*23`). Access-list rules referencing old IDs stop matching.
**Symptom:** `client was disconnected because could not assign vlan` error on CAP interfaces.
**Fix:**
```routeros
# Check current IDs
:foreach i in=[/interface wifi find where name~"cap"] do={
:put ([/interface wifi get $i name] . " = " . $i)
}
# Update access-list rules
/interface wifi access-list set [find where interface=*OLD] interface=*NEW
```
**Best practice:** Don't use CAP-specific access-list rules. Let all clients (HAP and CAP) use the same MAC-based access list. The HAP handles VLAN assignment uniformly via CAPsMAN.
---
## Pitfall 4: CAP Radio-to-Interface Mapping Swap
**Severity:** MEDIUM
**Problem:** After re-provisioning, `cap-wifi1` and `cap-wifi2` may swap which physical radio they map to. Assigning 5GHz config to the 2.4GHz radio (or vice versa) causes "no available channels" error.
**Identification:** Check MAC addresses:
| MAC suffix | Radio | Must receive |
|------------|-------|--------------|
| :BE | 2.4GHz | 2.4GHz config (XTRM2) |
| :BF | 5GHz | 5GHz config (XTRM) |
**Fix:** Match config to the correct radio MAC, not the interface name.
---
## Pitfall 5: CAP Band Must Be AC, Not AX
**Severity:** MEDIUM
**Problem:** The cAP XL ac only supports 802.11ac. Setting band to `5ghz-ax` results in the radio not starting.
**Fix:** Always use `5ghz-ac` for the CAP 5GHz channel configuration.
---
## Pitfall 6: IoT Devices Need Legacy WiFi Settings
**Severity:** HIGH
**Problem:** Many IoT devices (vacuums, smart gateways, ovens) require legacy WiFi settings to connect. Using 802.11n-only or WPA2-only silently prevents connections — the router sees zero attempts.
**Required XTRM2 (2.4GHz) settings:**
| Setting | Value | Reason |
|---------|-------|--------|
| Band | `2ghz-g` | NOT `2ghz-n` — IoT devices may only support 802.11g |
| Auth | `wpa-psk,wpa2-psk` | Some devices need WPA1 available |
| Encryption | `tkip,ccmp` | Some devices need TKIP |
| Channel width | `20mhz` | Maximum compatibility |
| FT (802.11r) | Disabled | Causes issues with IoT |
**Known devices requiring legacy support:**
- Roborock S7 Vacuum (B0:4A:39:3F:9A:14)
- Tuya Smart Gateway JMWZG1 (38:1F:8D:04:6F:E4)
- Bosch Oven (94:27:70:1E:0C:EE)
- Various other IoT appliances
---
## 5GHz Channel Separation
HAP and CAP must use different 5GHz channels to avoid co-channel interference:
| Device | Channel | Frequency | Band |
|--------|---------|-----------|------|
| HAP wifi1 | 149 | 5745 MHz | 5ghz-ax |
| CAP cap-wifi2 | 36 | 5180 MHz | 5ghz-ac |
Both use `skip-dfs-channels=all` to avoid radar detection disconnects.
---
## Diagnostic Checklist
When devices can't connect to WiFi, check in this order:
### Step 1: Check Security (Most Common Issue)
```routeros
# Check if AUTH-TYPE is empty in registration table (= open network!)
/interface wifi registration-table print
# Check inline security overrides
:put [/interface wifi get wifi2 security.authentication-types]
:put [/interface wifi get wifi2 security.encryption]
# If empty → security is broken, set explicit values
```
### Step 2: Check Band Compatibility
```routeros
/interface wifi monitor wifi2 once
# If channel shows /n → change to /g for IoT compatibility
```
### Step 3: Enable Debug Logging
```routeros
/system logging add topics=wireless,debug action=memory
# Then check: /log print where topics~"wireless"
```
### Step 4: Check CAP Interface IDs
```routeros
# Verify access-list rules reference current IDs
:foreach i in=[/interface wifi find where name~"cap"] do={
:put ([/interface wifi get $i name] . " = " . $i)
}
/interface wifi access-list print where interface~"\\*"
```
### Step 5: Check Radio-MAC Mapping
```routeros
# Verify cap interfaces are assigned to correct radios
/interface wifi print where name~"cap" proplist=name,mac-address,channel.band
```
### Step 6: If Router Sees ZERO Attempts
This means:
- **Security mismatch** — device won't even try (most likely empty auth = open)
- **Band incompatibility** — 802.11n-only blocks 802.11g devices
- **Device-side issue** — power cycle device, re-do WiFi setup from scratch
---
## Quick Recovery Commands
### Restore XTRM2 (2.4GHz) to known working state
```routeros
/interface wifi security set sec-xtrm2 \
authentication-types=wpa-psk,wpa2-psk encryption=tkip,ccmp
/interface wifi set wifi2 \
security.authentication-types=wpa-psk,wpa2-psk \
security.encryption=tkip,ccmp \
security.ft=no security.ft-over-ds=no
/interface wifi channel set ch-2g-hap \
frequency=2412 band=2ghz-g width=20mhz
```
### Restore XTRM (5GHz) to known working state
```routeros
/interface wifi security set sec-xtrm \
authentication-types=wpa2-psk,wpa3-psk encryption=ccmp
/interface wifi set wifi1 \
security.authentication-types=wpa2-psk,wpa3-psk \
security.encryption=ccmp \
security.ft=no security.ft-over-ds=no
/interface wifi channel set ch-5g-hap \
frequency=5745 band=5ghz-ax width=20/40/80mhz skip-dfs-channels=all
```
### Restore CAP interfaces after re-provisioning
```routeros
/interface wifi set cap-wifi1 \
configuration=cfg-xtrm2 security=sec-xtrm2 datapath=dp-cap \
channel.frequency=2472 channel.band=2ghz-g channel.width=20mhz
/interface wifi set cap-wifi2 \
configuration=cfg-xtrm security=sec-xtrm datapath=dp-cap \
channel.frequency=5180 channel.band=5ghz-ac \
channel.width=20/40/80mhz channel.skip-dfs-channels=all
/interface wifi enable cap-wifi1
/interface wifi enable cap-wifi2
```

View File

@@ -2,6 +2,138 @@
**Purpose:** Major infrastructure events only. Minor changes are in git commit messages. **Purpose:** Major infrastructure events only. Minor changes are in git commit messages.
---
## 2026-03-12
### WiFi Optimization & Troubleshooting
- **[WIFI]** Moved HAP 5GHz from ch 36 (5180) to ch 149 (5745), skip-dfs-channels=all
- **[WIFI]** Moved CAP 5GHz from ch 52 (5260) to ch 36 (5180), band corrected from ax to ac
- **[WIFI]** Separated HAP/CAP 5GHz channels to avoid co-channel interference
- **[WIFI]** Fixed sec-xtrm2 security: WPA+WPA2 with TKIP+CCMP for IoT compatibility
- **[WIFI]** Fixed critical bug: empty inline security.authentication-types="" caused wifi2 to run as open network — IoT devices silently failed to connect
- **[WIFI]** Set explicit encryption on all interfaces and security profiles (never leave empty)
- **[WIFI]** Removed CAP-specific access-list catch-all rules — all clients now use unified MAC-based access list
- **[WIFI]** Fixed CAP interface IDs in access-list after re-provisioning (*20/*21 → *22/*23)
- **[WIFI]** Restored 2.4GHz band to 2ghz-g (was changed to 2ghz-n, breaking IoT devices)
- **[WIFI]** Disabled FT (802.11r) on wifi1 (5GHz) for stability
- **[DOCS]** Added 12-WIFI-TROUBLESHOOTING.md with diagnostic checklist and recovery commands
---
## 2026-02-28
### Docker Container Audit & Migration to Dockge
- **[DOCKER]** Removed 4 orphan images: nextcloud/all-in-one, olprog/unraid-docker-webui, ghcr.io/ich777/doh-server, ghcr.io/idmedia/hass-unraid
- **[DOCKER]** Removed ancient pgAdmin4 v2.1 (status=Created) and fenglc/pgadmin4 image
- **[DOCKER]** Removed spaceinvaderone/ha_inabox image (replaced by Home-Assistant-Container)
- **[TRAEFIK]** Removed Docker provider constraint (`traefik.constraint=valid`) — Docker labels now auto-discovered
- **[TRAEFIK]** Cleaned up dynamic.yml: removed 14 stale/migrated router+service pairs (pangolin, pihole, doh, netbox, and services now using Docker labels)
- **[TRAEFIK]** Added dockge-secure router to dynamic.yml
- **[DOCKER]** Created 6 new Dockge stacks: docker-socket-proxy, tuyagateway, firefly, seekandwatch, ha-time-machine, homeassistant (replaced inabox with Container)
- **[DOCKER]** Migrated ALL 53 containers from dockerman to Dockge compose stacks (100% coverage)
- **[DOCKER]** Fixed Nextcloud Traefik rule: empty Host() → Host(`cloud.xtrm-lab.org`)
- **[DOCKER]** Fixed UptimeKuma Traefik rule: empty Host() → Host(`uptime.xtrm-lab.org`)
- **[DOCKER]** Fixed Homarr domain: `homarr.xtrm-lab.org``xtrm-lab.org` (root domain)
- **[DOCKER]** Fixed Netdisco entrypoint: `websecure``https`
- **[DOCKER]** Removed stale `traefik.constraint=valid` from Dockhand
- **[DOCKER]** Fixed Transmission middleware: removed non-existent `transmission-headers@file`
- **[DOCKER]** Added Authentik forward auth middleware to: n8n, homarr, transmission, speedtest-tracker, uptime-kuma, firefly, seekandwatch, open-webui, traefik dashboard, dockge, netalertx, urbackup, unimus
- **[DOCKER]** Added Traefik labels to: vaultwarden, open-webui (ai.xtrm-lab.org), firefly, seekandwatch
- **[DOCKER]** Added missing Unraid labels (icon, managed, webui) to: ntfy, timemachine, ollama, docker-socket-proxy, tuyagateway, all new stacks
- **[DOCKER]** Moved ollama + open-webui from bridge to dockerproxy network
- **[DOCKER]** Moved fireflyiii + firefly-data-importer from none to dockerproxy network
- **[DOCKER]** Moved SeekAndWatch from bridge to dockerproxy network
- **[DOCKER]** Removed traefik labels from host-network containers (plex, netalertx) — routed via dynamic.yml only
- **[DOCKER]** Fixed NetAlertX: added read_only, proper capabilities (NET_RAW/NET_ADMIN), and UID 20211
- **[DOCKER]** Removed empty netbox stack directory
## 2026-03-09
### Claude Code Tooling Completion
- **[SERVICE]** Installed Cooperator CLI v3.36.1 on Unraid (`npm install -g @ampeco/cooperator`)
- **[SERVICE]** Ran `cooperator install --non-interactive` — symlinked commands, agents, 12 skills to `~/.claude/`
- **[SERVICE]** Created `~/.cooperator/.env` with Shortcut API token, Confluence token, git config
- **[SERVICE]** Installed glab CLI v1.89.0 on Unraid (`/usr/local/bin/glab`) — authenticated as kaloyan.danchev
- **[SERVICE]** Installed uv package manager + Python 3.12.13 on Unraid
- **[SERVICE]** Created Python venvs for mikrotik-mcp and unraid-mcp projects
- **[SERVICE]** Copied MikroTik SSH key from Mac to Unraid — SSH to HAP ax3 verified working
- **[SERVICE]** Synced 6 custom Claude skills to `/mnt/user/appdata/claude-code/custom-skills/` (ev-compliance-story, ev-protocol-expert, frontend-designer, mikrotik-admin, prd-generator, unraid-admin)
- **[SERVICE]** Built shortcut MCP server at `/mnt/user/appdata/claude-code/mcp-server-shortcut/`
- **[SERVICE]** Enabled Claude plugins: ralph-loop, claude-md-management, playground
- **[DOCS]** Updated 12-DEVELOPMENT-ENVIRONMENT.md with Cooperator, glab, Python, skills, MCP sections
#### TODO — MCP Server Registration
The following MCP servers are built/ready but need `claude mcp add` registration (requires interactive Claude session on Unraid):
- shortcut, mikrotik, unraid, playwright, smartbear
## 2026-03-08
### Development Environment Setup
- **[SERVICE]** Installed OpenVSCode Server as host-native process (port 3100, not a container) — accessible at https://code.xtrm-lab.org
- **[SERVICE]** Traefik route added in dynamic.yml with Authentik forward auth
- **[SERVICE]** Boot auto-start via `/boot/config/go``/mnt/user/appdata/openvscode/start.sh`
- **[SERVICE]** Claude Code updated to v2.1.71, persistent at `/mnt/user/appdata/claude-code/.npm-global/`
- **[SERVICE]** Cooperator CLI v3.36.1 installed globally (`npm install -g @ampeco/cooperator`)
- **[SERVICE]** Created `/mnt/user/projects/` workspace with 12 personal repos (Gitea) + 18 AMPECO work projects (GitLab)
- **[DOCS]** Added `12-DEVELOPMENT-ENVIRONMENT.md` documenting full dev environment setup
### Docker Maintenance
- **[DOCKER]** Created Unraid Docker Manager XML templates for 11 containers missing them (adguardhome, gitea, minecraft, ntfy, ollama, open-webui, etc.)
- **[DOCKER]** Pulled new images for all 30 active Dockge stacks, 14 containers received updates
- **[DOCKER]** Cleaned up dangling images: 10.95 GB reclaimed
- **[DOCKER]** Organized all 42 containers into Docker Folders (12 folders: Infrastructure, Security, Monitoring, DevOps, Media, etc.)
- **[DOCKER]** Pushed 6 local-only projects to Gitea (claude-skills, mikrotik-mcp, unraid-mcp, nanobot-mcp, nanobot-hkuds, openclaw)
### Service Fixes
- **[FIX]** Gitea DB connection: fixed hardcoded PostgreSQL IP (172.18.0.13) → hostname `postgresql17` in compose and app.ini
- **[FIX]** Traefik: removed stale stopped container blocking restart
- **[FIX]** Redis: removed stale stopped container blocking recreate
## 2026-02-26
### WiFi & CAP VLAN Fixes
- **[WIFI]** Fixed 5GHz channel overlap: HAP wifi1 reduced from 80MHz to 40MHz at 5180MHz, CAP cap-wifi1 at 5220MHz (no overlap)
- **[WIFI]** Restored all 29 WiFi access-list MAC→VLAN entries (were missing/lost)
- **[WIFI]** Fixed cap-wifi2 band mismatch: was `band=2ghz-n` with frequency=5220 (5GHz), corrected to frequency=2412
- **[CAPSMAN]** Enabled bridge VLAN filtering on CAP (cAP XL ac) — all VLANs now properly tagged through CAP
- **[CAPSMAN]** CAP bridgeLocal config: vlan-filtering=yes, pvid=10, VLANs 10/20/25/30/35/40 with proper tagged/untagged members
- **[CAPSMAN]** Set `capdp` datapath vlan-id=40 for default PVID on dynamic wifi bridge ports
- **[CAPSMAN]** VLAN assignment through CAP now working — access-list vlan-id entries propagate correctly
- **[NETWORK]** Fixed AdGuard Home IP conflict: container was at 192.168.10.2 (CAP's IP), now static at 192.168.10.10
- **[NETWORK]** Fixed adguardhome-sync IP conflict: was at 192.168.10.3 (CSS326's IP), now static at 192.168.10.11
- **[WIFI]** Added Xiaomi Air Purifier 2 (C8:5C:CC:40:B4:AA) to access-list as VLAN 30 (IoT)
### WiFi Quality Optimization
- **[WIFI]** Fixed 2.4GHz co-channel interference: HAP on ch 1 (2412), CAP moved from ch 1 to ch 6 (2437)
- **[WIFI]** Fixed 5GHz overlap: HAP stays ch 36 (5180, 40MHz), CAP moved from ch 44 (5220) to ch 52 (5260, DFS)
- **[WIFI]** Fixed CAP 2.4GHz width from 40MHz to 20MHz for IoT compatibility
- **[WIFI]** TX power kept at defaults (17/16 dBm) — reduction caused kitchen coverage loss through concrete walls
## 2026-02-24
### Motherboard Replacement & NVMe Cache Pool
- **[HARDWARE]** Replaced XTRM-U motherboard — new MAC `38:05:25:35:8E:7A`, DHCP lease updated on MikroTik
- **[HARDWARE]** Confirmed disk1 (10TB HGST HUH721010ALE601, serial 2TKK3K1D) mechanically dead — clicking heads, fails on multiple SATA ports and new motherboard
- **[STORAGE]** Created new Unraid-managed cache pool: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1 (~1.8TB usable)
- **[STORAGE]** Pool settings: autotrim=on, compression=on
- **[DOCKER]** Migrated Docker from btrfs loopback image (disk1 HDD) to ZFS on NVMe cache pool
- **[DOCKER]** Docker now uses ZFS storage driver directly on `cache/system/docker` dataset
- **[DOCKER]** Recreated `dockerproxy` bridge network, rebuilt all 39 container templates
- **[DOCKER]** Restarted Dockge and critical stacks (adguardhome, ntfy, gitea, woodpecker, etc.)
- **[STORAGE]** Deleted old `docker.img` (200GB) from disk1
- **[INCIDENT]** disk1 still running in parity-emulated mode — replacement drive needed
### Post-Migration Container Cleanup
- **[NETWORK]** Fixed Traefik unreachable: removed stale Docker bridge (duplicate 172.18.0.0/16 subnet) + 7 orphaned bridges
- **[DOCKER]** Removed deprecated containers: DoH-Server, binhex-plexpass (duplicate of Plex)
- **[DOCKER]** Removed obsolete containers: HomeAssistant_inabox, Docker-WebUI, hass-unraid
- **[DOCKER]** Removed nextcloud-aio-mastercontainer (replaced by Nextcloud container)
- **[SERVICE]** Fixed adguardhome-sync: recreated config file (was directory from migration), switched to br0 network for macvlan reachability
- **[SERVICE]** Fixed diode stack: recreated .env, nginx.conf, OAuth2 client config; ran Hydra DB migration and client bootstrap
- **[SERVICE]** Fixed diode-agent: corrected YAML format, secrets, and Hydra authentication
- **[SERVICE]** Started unmarr (Homarr fork, 172.18.0.81) and rustfs (S3-compatible storage)
- **[DOCKER]** Final state: 53 containers running, pgAdmin4 stopped (utility)
- **[DOCS]** Updated 03-SERVICES-OTHER.md with removed containers
--- ---
## 2026-02-14 ## 2026-02-14

View File

@@ -0,0 +1,91 @@
# Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)
**Date:** 2026-02-20
**Severity:** P2 - Degraded (no redundancy)
**Status:** Open — awaiting replacement drive (motherboard replaced, NVMe cache pool added Feb 24)
**Affected:** XTRM-U (Unraid NAS) — disk1 (data drive)
---
## Summary
disk1 (10TB HGST Ultrastar HUH721010ALE601, serial `2TKK3K1D`) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in **degraded/emulated mode**, reconstructing disk1 data from parity on the fly. All data is intact but there is **zero redundancy**.
---
## Timeline
| When | What |
|------|------|
| Feb 18 ~19:15 | `ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up``ata5.00: disable device` |
| Feb 18 19:17 | `super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors) |
| Feb 20 13:14 | Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors |
| Feb 20 ~13:30 | Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed |
| Feb 24 | Motherboard replaced. Dead drive confirmed still dead on new hardware. New SATA port assignment. Drive is mechanically failed (clicking heads) |
| Feb 24 | New cache pool created: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1. Docker migrated from HDD loopback to NVMe ZFS |
## Drive Details
| Field | Value |
|-------|-------|
| Model | HUH721010ALE601 (HGST/WD Ultrastar He10) |
| Serial | 2TKK3K1D |
| Capacity | 10TB (9766436812 sectors) |
| Array slot | disk1 (slot 1) |
| Filesystem | ZFS (on md1p1) |
| Last known device | sdc |
| Accumulated md errors | 213 |
## Current State
- **Array**: STARTED, degraded — disk1 emulated from parity (`sdb`)
- **ZFS pool `disk1`**: ONLINE, 0 errors, mounted on `md1p1` (parity reconstruction)
- **Parity drive** (`sdb`, serial `7PHBNYZC`): DISK_OK, 0 errors
- **All services**: Running normally (Docker containers, VMs)
- **Risk**: If parity drive fails, data is **unrecoverable**
## Diagnosis
- Drive fails on multiple SATA ports → not a port/cable issue
- Clicking noise on boot → mechanical head failure
- dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
- Drive is beyond DIY repair
## Root Cause
Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.
---
## Recovery Plan
### Step 1: Get Replacement Drive
- Must be 10TB or larger
- Check WD warranty: serial `HUH721010ALE601_2TKK3K1D` at https://support-en.wd.com/app/warrantycheck
- Any 3.5" SATA drive works (doesn't need to match model)
### Step 2: Install & Rebuild
1. Power off the server
2. Remove dead drive, install replacement in any SATA port
3. Boot Unraid
4. Go to **Main** → click on **Disk 1** (will show as "Not installed" or unmapped)
5. Stop the array
6. Assign the new drive to the **Disk 1** slot
7. Start the array — Unraid will prompt to **rebuild** from parity
8. Rebuild will take many hours for 10TB — do NOT interrupt
### Step 3: Post-Rebuild
1. Verify ZFS pool `disk1` is healthy: `zpool status disk1`
2. Run parity check from Unraid UI
3. Run SMART extended test on new drive: `smartctl -t long /dev/sdX`
4. Verify all ZFS datasets are intact
---
## Notes
- Server is safe to run in degraded mode indefinitely, just without parity protection
- Avoid heavy writes if possible to reduce risk to parity drive
- New cache pool (3x Samsung 990 EVO Plus 1TB, ZFS RAIDZ1) now hosts all Docker containers
- Old docker.img loopback deleted from disk1 (200GB freed)
- Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair