Files

Kaloyan Danchev bf6a62a275 Add incident report: disk1 hardware failure (clicking/head crash)

HGST Ultrastar 10TB drive (serial 2TKK3K1D) failed on Feb 18.
Array running degraded on parity emulation. Recovery plan documented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-22 17:54:23 +02:00

3.5 KiB

Raw Blame History

Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)

Date: 2026-02-20 Severity: P2 - Degraded (no redundancy) Status: Open — awaiting replacement drive Affected: XTRM-U (Unraid NAS) — disk1 (data drive)

Summary

disk1 (10TB HGST Ultrastar HUH721010ALE601, serial 2TKK3K1D) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in degraded/emulated mode, reconstructing disk1 data from parity on the fly. All data is intact but there is zero redundancy.

Timeline

When	What
Feb 18 ~19:15	`ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up` → `ata5.00: disable device`
Feb 18 19:17	`super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors)
Feb 20 13:14	Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors
Feb 20 ~13:30	Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed

Drive Details

Field	Value
Model	HUH721010ALE601 (HGST/WD Ultrastar He10)
Serial	2TKK3K1D
Capacity	10TB (9766436812 sectors)
Array slot	disk1 (slot 1)
Filesystem	ZFS (on md1p1)
Last known device	sdc
Accumulated md errors	213

Current State

Array: STARTED, degraded — disk1 emulated from parity (sdb)
ZFS pool disk1: ONLINE, 0 errors, mounted on md1p1 (parity reconstruction)
Parity drive (sdb, serial 7PHBNYZC): DISK_OK, 0 errors
All services: Running normally (Docker containers, VMs)
Risk: If parity drive fails, data is unrecoverable

Diagnosis

Drive fails on multiple SATA ports → not a port/cable issue
Clicking noise on boot → mechanical head failure
dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
Drive is beyond DIY repair

Root Cause

Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.

Recovery Plan

Step 1: Get Replacement Drive

Must be 10TB or larger
Check WD warranty: serial HUH721010ALE601_2TKK3K1D at https://support-en.wd.com/app/warrantycheck
Any 3.5" SATA drive works (doesn't need to match model)

Step 2: Install & Rebuild

Power off the server
Remove dead drive, install replacement in any SATA port
Boot Unraid
Go to Main → click on Disk 1 (will show as "Not installed" or unmapped)
Stop the array
Assign the new drive to the Disk 1 slot
Start the array — Unraid will prompt to rebuild from parity
Rebuild will take many hours for 10TB — do NOT interrupt

Step 3: Post-Rebuild

Verify ZFS pool disk1 is healthy: zpool status disk1
Run parity check from Unraid UI
Run SMART extended test on new drive: smartctl -t long /dev/sdX
Verify all ZFS datasets are intact

Notes

Server is safe to run in degraded mode indefinitely, just without parity protection
Avoid heavy writes if possible to reduce risk to parity drive
The two NVMe SSDs (cache pool, ZFS mirror) are unaffected
Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair

3.5 KiB Raw Blame History