Compare commits
2 Commits
0119c4d4d8
...
877aa71d3e
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
877aa71d3e | ||
|
|
bf6a62a275 |
@@ -1,6 +1,6 @@
|
|||||||
# Hardware Inventory
|
# Hardware Inventory
|
||||||
|
|
||||||
**Last Updated:** 2026-02-14
|
**Last Updated:** 2026-02-24
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -109,18 +109,27 @@
|
|||||||
| **IP** | 192.168.10.20 |
|
| **IP** | 192.168.10.20 |
|
||||||
| **OS** | Unraid 6.x |
|
| **OS** | Unraid 6.x |
|
||||||
|
|
||||||
|
**Motherboard:** Replaced 2026-02-24 (new board, details TBD)
|
||||||
|
|
||||||
**Network:**
|
**Network:**
|
||||||
| Interface | MAC | Speed |
|
| Interface | MAC | Speed |
|
||||||
|-----------|-----|-------|
|
|-----------|-----|-------|
|
||||||
| eth1 | A8:B8:E0:02:B6:15 | 2.5G |
|
| br0 | 38:05:25:35:8E:7A | 2.5G |
|
||||||
| eth2 | A8:B8:E0:02:B6:16 | 2.5G |
|
|
||||||
| eth3 | A8:B8:E0:02:B6:17 | 2.5G |
|
|
||||||
| eth4 | A8:B8:E0:02:B6:18 | 2.5G |
|
|
||||||
| **bond0** | (virtual) | 5G aggregate |
|
|
||||||
|
|
||||||
**Storage:**
|
**Storage:**
|
||||||
- Cache: (current NVMe)
|
| Device | Model | Size | Role | Status |
|
||||||
- Array: 3.5" HDDs
|
|--------|-------|------|------|--------|
|
||||||
|
| sdb | HUH721010ALE601 (serial 7PHBNYZC) | 10TB | Parity | OK |
|
||||||
|
| disk1 | HUH721010ALE601 (serial 2TKK3K1D) | 10TB | Data (ZFS) | **FAILED** — clicking/head crash, emulated from parity |
|
||||||
|
| nvme0n1 | Samsung 990 EVO Plus 1TB | 1TB | Cache pool (RAIDZ1) | OK |
|
||||||
|
| nvme1n1 | Samsung 990 EVO Plus 1TB | 1TB | Cache pool (RAIDZ1) | OK |
|
||||||
|
| nvme2n1 | Samsung 990 EVO Plus 1TB | 1TB | Cache pool (RAIDZ1) | OK |
|
||||||
|
|
||||||
|
**ZFS Pools:**
|
||||||
|
| Pool | Devices | Profile | Usable | Purpose |
|
||||||
|
|------|---------|---------|--------|---------|
|
||||||
|
| disk1 | md1p1 (parity-emulated) | single | 9.1TB | Main data (roms, media, appdata, backups) |
|
||||||
|
| cache | 3x Samsung 990 EVO Plus 1TB NVMe | RAIDZ1 | ~1.8TB | Docker, containers |
|
||||||
|
|
||||||
**Virtual IPs:**
|
**Virtual IPs:**
|
||||||
| IP | Purpose |
|
| IP | Purpose |
|
||||||
@@ -223,6 +232,7 @@ See: `wip/UPGRADE-2026-HARDWARE.md`
|
|||||||
|--------|------|--------|
|
|--------|------|--------|
|
||||||
| XTRM-N5 (Minisforum N5 Air) | Production server | Planned |
|
| XTRM-N5 (Minisforum N5 Air) | Production server | Planned |
|
||||||
| XTRM-N1 (N100 ITX) | Survival node | Planned |
|
| XTRM-N1 (N100 ITX) | Survival node | Planned |
|
||||||
| 3x Samsung 990 EVO Plus 1TB | XTRM-N5 NVMe pool | Planned |
|
| 3x Samsung 990 EVO Plus 1TB | XTRM-U cache pool (RAIDZ1) | **Installed** 2026-02-24 |
|
||||||
| 2x Fikwot FX501Pro 512GB | XTRM-N1 mirror | Planned |
|
| 2x Fikwot FX501Pro 512GB | XTRM-N1 mirror | Planned |
|
||||||
|
| 1x 10TB+ HDD | Replace failed disk1 | **Needed** |
|
||||||
| MikroTik CRS310-8G+2S+IN | Replace ZX1 | Future |
|
| MikroTik CRS310-8G+2S+IN | Replace ZX1 | Future |
|
||||||
|
|||||||
@@ -4,6 +4,22 @@
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## 2026-02-24
|
||||||
|
|
||||||
|
### Motherboard Replacement & NVMe Cache Pool
|
||||||
|
- **[HARDWARE]** Replaced XTRM-U motherboard — new MAC `38:05:25:35:8E:7A`, DHCP lease updated on MikroTik
|
||||||
|
- **[HARDWARE]** Confirmed disk1 (10TB HGST HUH721010ALE601, serial 2TKK3K1D) mechanically dead — clicking heads, fails on multiple SATA ports and new motherboard
|
||||||
|
- **[STORAGE]** Created new Unraid-managed cache pool: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1 (~1.8TB usable)
|
||||||
|
- **[STORAGE]** Pool settings: autotrim=on, compression=on
|
||||||
|
- **[DOCKER]** Migrated Docker from btrfs loopback image (disk1 HDD) to ZFS on NVMe cache pool
|
||||||
|
- **[DOCKER]** Docker now uses ZFS storage driver directly on `cache/system/docker` dataset
|
||||||
|
- **[DOCKER]** Recreated `dockerproxy` bridge network, rebuilt all 39 container templates
|
||||||
|
- **[DOCKER]** Restarted Dockge and critical stacks (adguardhome, ntfy, gitea, woodpecker, etc.)
|
||||||
|
- **[STORAGE]** Deleted old `docker.img` (200GB) from disk1
|
||||||
|
- **[INCIDENT]** disk1 still running in parity-emulated mode — replacement drive needed
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## 2026-02-14
|
## 2026-02-14
|
||||||
|
|
||||||
### CAP XL ac Recovery
|
### CAP XL ac Recovery
|
||||||
|
|||||||
91
docs/incidents/2026-02-20-disk1-hardware-failure.md
Normal file
91
docs/incidents/2026-02-20-disk1-hardware-failure.md
Normal file
@@ -0,0 +1,91 @@
|
|||||||
|
# Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)
|
||||||
|
|
||||||
|
**Date:** 2026-02-20
|
||||||
|
**Severity:** P2 - Degraded (no redundancy)
|
||||||
|
**Status:** Open — awaiting replacement drive (motherboard replaced, NVMe cache pool added Feb 24)
|
||||||
|
**Affected:** XTRM-U (Unraid NAS) — disk1 (data drive)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
disk1 (10TB HGST Ultrastar HUH721010ALE601, serial `2TKK3K1D`) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in **degraded/emulated mode**, reconstructing disk1 data from parity on the fly. All data is intact but there is **zero redundancy**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline
|
||||||
|
|
||||||
|
| When | What |
|
||||||
|
|------|------|
|
||||||
|
| Feb 18 ~19:15 | `ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up` → `ata5.00: disable device` |
|
||||||
|
| Feb 18 19:17 | `super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors) |
|
||||||
|
| Feb 20 13:14 | Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors |
|
||||||
|
| Feb 20 ~13:30 | Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed |
|
||||||
|
| Feb 24 | Motherboard replaced. Dead drive confirmed still dead on new hardware. New SATA port assignment. Drive is mechanically failed (clicking heads) |
|
||||||
|
| Feb 24 | New cache pool created: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1. Docker migrated from HDD loopback to NVMe ZFS |
|
||||||
|
|
||||||
|
## Drive Details
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| Model | HUH721010ALE601 (HGST/WD Ultrastar He10) |
|
||||||
|
| Serial | 2TKK3K1D |
|
||||||
|
| Capacity | 10TB (9766436812 sectors) |
|
||||||
|
| Array slot | disk1 (slot 1) |
|
||||||
|
| Filesystem | ZFS (on md1p1) |
|
||||||
|
| Last known device | sdc |
|
||||||
|
| Accumulated md errors | 213 |
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
- **Array**: STARTED, degraded — disk1 emulated from parity (`sdb`)
|
||||||
|
- **ZFS pool `disk1`**: ONLINE, 0 errors, mounted on `md1p1` (parity reconstruction)
|
||||||
|
- **Parity drive** (`sdb`, serial `7PHBNYZC`): DISK_OK, 0 errors
|
||||||
|
- **All services**: Running normally (Docker containers, VMs)
|
||||||
|
- **Risk**: If parity drive fails, data is **unrecoverable**
|
||||||
|
|
||||||
|
## Diagnosis
|
||||||
|
|
||||||
|
- Drive fails on multiple SATA ports → not a port/cable issue
|
||||||
|
- Clicking noise on boot → mechanical head failure
|
||||||
|
- dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
|
||||||
|
- Drive is beyond DIY repair
|
||||||
|
|
||||||
|
## Root Cause
|
||||||
|
|
||||||
|
Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Recovery Plan
|
||||||
|
|
||||||
|
### Step 1: Get Replacement Drive
|
||||||
|
- Must be 10TB or larger
|
||||||
|
- Check WD warranty: serial `HUH721010ALE601_2TKK3K1D` at https://support-en.wd.com/app/warrantycheck
|
||||||
|
- Any 3.5" SATA drive works (doesn't need to match model)
|
||||||
|
|
||||||
|
### Step 2: Install & Rebuild
|
||||||
|
1. Power off the server
|
||||||
|
2. Remove dead drive, install replacement in any SATA port
|
||||||
|
3. Boot Unraid
|
||||||
|
4. Go to **Main** → click on **Disk 1** (will show as "Not installed" or unmapped)
|
||||||
|
5. Stop the array
|
||||||
|
6. Assign the new drive to the **Disk 1** slot
|
||||||
|
7. Start the array — Unraid will prompt to **rebuild** from parity
|
||||||
|
8. Rebuild will take many hours for 10TB — do NOT interrupt
|
||||||
|
|
||||||
|
### Step 3: Post-Rebuild
|
||||||
|
1. Verify ZFS pool `disk1` is healthy: `zpool status disk1`
|
||||||
|
2. Run parity check from Unraid UI
|
||||||
|
3. Run SMART extended test on new drive: `smartctl -t long /dev/sdX`
|
||||||
|
4. Verify all ZFS datasets are intact
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- Server is safe to run in degraded mode indefinitely, just without parity protection
|
||||||
|
- Avoid heavy writes if possible to reduce risk to parity drive
|
||||||
|
- New cache pool (3x Samsung 990 EVO Plus 1TB, ZFS RAIDZ1) now hosts all Docker containers
|
||||||
|
- Old docker.img loopback deleted from disk1 (200GB freed)
|
||||||
|
- Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair
|
||||||
Reference in New Issue
Block a user