Files
infrastructure/docs/incidents/2026-02-20-disk1-hardware-failure.md
Kaloyan Danchev 877aa71d3e
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Update docs: motherboard swap, NVMe cache pool, Docker migration
- New motherboard installed, MAC/DHCP updated
- 3x Samsung 990 EVO Plus 1TB NVMe cache pool (ZFS RAIDZ1)
- Docker migrated from HDD loopback to NVMe ZFS storage driver
- disk1 confirmed dead (clicking heads), still on parity emulation
- Hardware inventory, changelog, and incident report updated

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 14:47:07 +02:00

3.9 KiB

Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)

Date: 2026-02-20 Severity: P2 - Degraded (no redundancy) Status: Open — awaiting replacement drive (motherboard replaced, NVMe cache pool added Feb 24) Affected: XTRM-U (Unraid NAS) — disk1 (data drive)


Summary

disk1 (10TB HGST Ultrastar HUH721010ALE601, serial 2TKK3K1D) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in degraded/emulated mode, reconstructing disk1 data from parity on the fly. All data is intact but there is zero redundancy.


Timeline

When What
Feb 18 ~19:15 ata5: qc timeout → multiple hard/soft resets → reset failed, giving upata5.00: disable device
Feb 18 19:17 super.dat updated — md array marked disk1 as DISK_DSBL (213 errors)
Feb 20 13:14 Investigation started. sdc completely absent from /dev/. ZFS pool disk1 running on emulated md1p1 with 0 errors
Feb 20 ~13:30 Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: ata6: reset failed, giving up. Clicking noise confirmed
Feb 24 Motherboard replaced. Dead drive confirmed still dead on new hardware. New SATA port assignment. Drive is mechanically failed (clicking heads)
Feb 24 New cache pool created: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1. Docker migrated from HDD loopback to NVMe ZFS

Drive Details

Field Value
Model HUH721010ALE601 (HGST/WD Ultrastar He10)
Serial 2TKK3K1D
Capacity 10TB (9766436812 sectors)
Array slot disk1 (slot 1)
Filesystem ZFS (on md1p1)
Last known device sdc
Accumulated md errors 213

Current State

  • Array: STARTED, degraded — disk1 emulated from parity (sdb)
  • ZFS pool disk1: ONLINE, 0 errors, mounted on md1p1 (parity reconstruction)
  • Parity drive (sdb, serial 7PHBNYZC): DISK_OK, 0 errors
  • All services: Running normally (Docker containers, VMs)
  • Risk: If parity drive fails, data is unrecoverable

Diagnosis

  • Drive fails on multiple SATA ports → not a port/cable issue
  • Clicking noise on boot → mechanical head failure
  • dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
  • Drive is beyond DIY repair

Root Cause

Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.


Recovery Plan

Step 1: Get Replacement Drive

Step 2: Install & Rebuild

  1. Power off the server
  2. Remove dead drive, install replacement in any SATA port
  3. Boot Unraid
  4. Go to Main → click on Disk 1 (will show as "Not installed" or unmapped)
  5. Stop the array
  6. Assign the new drive to the Disk 1 slot
  7. Start the array — Unraid will prompt to rebuild from parity
  8. Rebuild will take many hours for 10TB — do NOT interrupt

Step 3: Post-Rebuild

  1. Verify ZFS pool disk1 is healthy: zpool status disk1
  2. Run parity check from Unraid UI
  3. Run SMART extended test on new drive: smartctl -t long /dev/sdX
  4. Verify all ZFS datasets are intact

Notes

  • Server is safe to run in degraded mode indefinitely, just without parity protection
  • Avoid heavy writes if possible to reduce risk to parity drive
  • New cache pool (3x Samsung 990 EVO Plus 1TB, ZFS RAIDZ1) now hosts all Docker containers
  • Old docker.img loopback deleted from disk1 (200GB freed)
  • Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair