Files

Kaloyan Danchev 877aa71d3e

ci/woodpecker/push/woodpecker Pipeline was successful

Details

Update docs: motherboard swap, NVMe cache pool, Docker migration

- New motherboard installed, MAC/DHCP updated
- 3x Samsung 990 EVO Plus 1TB NVMe cache pool (ZFS RAIDZ1)
- Docker migrated from HDD loopback to NVMe ZFS storage driver
- disk1 confirmed dead (clicking heads), still on parity emulation
- Hardware inventory, changelog, and incident report updated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-24 14:47:07 +02:00

3.9 KiB

Raw Blame History

Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)

Date: 2026-02-20 Severity: P2 - Degraded (no redundancy) Status: Open — awaiting replacement drive (motherboard replaced, NVMe cache pool added Feb 24) Affected: XTRM-U (Unraid NAS) — disk1 (data drive)

Summary

disk1 (10TB HGST Ultrastar HUH721010ALE601, serial 2TKK3K1D) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in degraded/emulated mode, reconstructing disk1 data from parity on the fly. All data is intact but there is zero redundancy.

Timeline

When	What
Feb 18 ~19:15	`ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up` → `ata5.00: disable device`
Feb 18 19:17	`super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors)
Feb 20 13:14	Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors
Feb 20 ~13:30	Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed
Feb 24	Motherboard replaced. Dead drive confirmed still dead on new hardware. New SATA port assignment. Drive is mechanically failed (clicking heads)
Feb 24	New cache pool created: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1. Docker migrated from HDD loopback to NVMe ZFS

Drive Details

Field	Value
Model	HUH721010ALE601 (HGST/WD Ultrastar He10)
Serial	2TKK3K1D
Capacity	10TB (9766436812 sectors)
Array slot	disk1 (slot 1)
Filesystem	ZFS (on md1p1)
Last known device	sdc
Accumulated md errors	213

Current State

Array: STARTED, degraded — disk1 emulated from parity (sdb)
ZFS pool disk1: ONLINE, 0 errors, mounted on md1p1 (parity reconstruction)
Parity drive (sdb, serial 7PHBNYZC): DISK_OK, 0 errors
All services: Running normally (Docker containers, VMs)
Risk: If parity drive fails, data is unrecoverable

Diagnosis

Drive fails on multiple SATA ports → not a port/cable issue
Clicking noise on boot → mechanical head failure
dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
Drive is beyond DIY repair

Root Cause

Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.

Recovery Plan

Step 1: Get Replacement Drive

Must be 10TB or larger
Check WD warranty: serial HUH721010ALE601_2TKK3K1D at https://support-en.wd.com/app/warrantycheck
Any 3.5" SATA drive works (doesn't need to match model)

Step 2: Install & Rebuild

Power off the server
Remove dead drive, install replacement in any SATA port
Boot Unraid
Go to Main → click on Disk 1 (will show as "Not installed" or unmapped)
Stop the array
Assign the new drive to the Disk 1 slot
Start the array — Unraid will prompt to rebuild from parity
Rebuild will take many hours for 10TB — do NOT interrupt

Step 3: Post-Rebuild

Verify ZFS pool disk1 is healthy: zpool status disk1
Run parity check from Unraid UI
Run SMART extended test on new drive: smartctl -t long /dev/sdX
Verify all ZFS datasets are intact

Notes

Server is safe to run in degraded mode indefinitely, just without parity protection
Avoid heavy writes if possible to reduce risk to parity drive
New cache pool (3x Samsung 990 EVO Plus 1TB, ZFS RAIDZ1) now hosts all Docker containers
Old docker.img loopback deleted from disk1 (200GB freed)
Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair

3.9 KiB Raw Blame History