Files
infrastructure/docs/incidents/2026-02-20-disk1-hardware-failure.md
Kaloyan Danchev bf6a62a275 Add incident report: disk1 hardware failure (clicking/head crash)
HGST Ultrastar 10TB drive (serial 2TKK3K1D) failed on Feb 18.
Array running degraded on parity emulation. Recovery plan documented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 17:54:23 +02:00

3.5 KiB

Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)

Date: 2026-02-20 Severity: P2 - Degraded (no redundancy) Status: Open — awaiting replacement drive Affected: XTRM-U (Unraid NAS) — disk1 (data drive)


Summary

disk1 (10TB HGST Ultrastar HUH721010ALE601, serial 2TKK3K1D) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in degraded/emulated mode, reconstructing disk1 data from parity on the fly. All data is intact but there is zero redundancy.


Timeline

When What
Feb 18 ~19:15 ata5: qc timeout → multiple hard/soft resets → reset failed, giving upata5.00: disable device
Feb 18 19:17 super.dat updated — md array marked disk1 as DISK_DSBL (213 errors)
Feb 20 13:14 Investigation started. sdc completely absent from /dev/. ZFS pool disk1 running on emulated md1p1 with 0 errors
Feb 20 ~13:30 Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: ata6: reset failed, giving up. Clicking noise confirmed

Drive Details

Field Value
Model HUH721010ALE601 (HGST/WD Ultrastar He10)
Serial 2TKK3K1D
Capacity 10TB (9766436812 sectors)
Array slot disk1 (slot 1)
Filesystem ZFS (on md1p1)
Last known device sdc
Accumulated md errors 213

Current State

  • Array: STARTED, degraded — disk1 emulated from parity (sdb)
  • ZFS pool disk1: ONLINE, 0 errors, mounted on md1p1 (parity reconstruction)
  • Parity drive (sdb, serial 7PHBNYZC): DISK_OK, 0 errors
  • All services: Running normally (Docker containers, VMs)
  • Risk: If parity drive fails, data is unrecoverable

Diagnosis

  • Drive fails on multiple SATA ports → not a port/cable issue
  • Clicking noise on boot → mechanical head failure
  • dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
  • Drive is beyond DIY repair

Root Cause

Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.


Recovery Plan

Step 1: Get Replacement Drive

Step 2: Install & Rebuild

  1. Power off the server
  2. Remove dead drive, install replacement in any SATA port
  3. Boot Unraid
  4. Go to Main → click on Disk 1 (will show as "Not installed" or unmapped)
  5. Stop the array
  6. Assign the new drive to the Disk 1 slot
  7. Start the array — Unraid will prompt to rebuild from parity
  8. Rebuild will take many hours for 10TB — do NOT interrupt

Step 3: Post-Rebuild

  1. Verify ZFS pool disk1 is healthy: zpool status disk1
  2. Run parity check from Unraid UI
  3. Run SMART extended test on new drive: smartctl -t long /dev/sdX
  4. Verify all ZFS datasets are intact

Notes

  • Server is safe to run in degraded mode indefinitely, just without parity protection
  • Avoid heavy writes if possible to reduce risk to parity drive
  • The two NVMe SSDs (cache pool, ZFS mirror) are unaffected
  • Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair