From bf6a62a275364814fad604cab0aba649cda89fa6 Mon Sep 17 00:00:00 2001 From: Kaloyan Danchev Date: Sun, 22 Feb 2026 17:54:23 +0200 Subject: [PATCH] Add incident report: disk1 hardware failure (clicking/head crash) HGST Ultrastar 10TB drive (serial 2TKK3K1D) failed on Feb 18. Array running degraded on parity emulation. Recovery plan documented. Co-Authored-By: Claude Opus 4.6 --- .../2026-02-20-disk1-hardware-failure.md | 88 +++++++++++++++++++ 1 file changed, 88 insertions(+) create mode 100644 docs/incidents/2026-02-20-disk1-hardware-failure.md diff --git a/docs/incidents/2026-02-20-disk1-hardware-failure.md b/docs/incidents/2026-02-20-disk1-hardware-failure.md new file mode 100644 index 0000000..6bb8c0f --- /dev/null +++ b/docs/incidents/2026-02-20-disk1-hardware-failure.md @@ -0,0 +1,88 @@ +# Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure) + +**Date:** 2026-02-20 +**Severity:** P2 - Degraded (no redundancy) +**Status:** Open — awaiting replacement drive +**Affected:** XTRM-U (Unraid NAS) — disk1 (data drive) + +--- + +## Summary + +disk1 (10TB HGST Ultrastar HUH721010ALE601, serial `2TKK3K1D`) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in **degraded/emulated mode**, reconstructing disk1 data from parity on the fly. All data is intact but there is **zero redundancy**. + +--- + +## Timeline + +| When | What | +|------|------| +| Feb 18 ~19:15 | `ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up` → `ata5.00: disable device` | +| Feb 18 19:17 | `super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors) | +| Feb 20 13:14 | Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors | +| Feb 20 ~13:30 | Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed | + +## Drive Details + +| Field | Value | +|-------|-------| +| Model | HUH721010ALE601 (HGST/WD Ultrastar He10) | +| Serial | 2TKK3K1D | +| Capacity | 10TB (9766436812 sectors) | +| Array slot | disk1 (slot 1) | +| Filesystem | ZFS (on md1p1) | +| Last known device | sdc | +| Accumulated md errors | 213 | + +## Current State + +- **Array**: STARTED, degraded — disk1 emulated from parity (`sdb`) +- **ZFS pool `disk1`**: ONLINE, 0 errors, mounted on `md1p1` (parity reconstruction) +- **Parity drive** (`sdb`, serial `7PHBNYZC`): DISK_OK, 0 errors +- **All services**: Running normally (Docker containers, VMs) +- **Risk**: If parity drive fails, data is **unrecoverable** + +## Diagnosis + +- Drive fails on multiple SATA ports → not a port/cable issue +- Clicking noise on boot → mechanical head failure +- dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead +- Drive is beyond DIY repair + +## Root Cause + +Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure. + +--- + +## Recovery Plan + +### Step 1: Get Replacement Drive +- Must be 10TB or larger +- Check WD warranty: serial `HUH721010ALE601_2TKK3K1D` at https://support-en.wd.com/app/warrantycheck +- Any 3.5" SATA drive works (doesn't need to match model) + +### Step 2: Install & Rebuild +1. Power off the server +2. Remove dead drive, install replacement in any SATA port +3. Boot Unraid +4. Go to **Main** → click on **Disk 1** (will show as "Not installed" or unmapped) +5. Stop the array +6. Assign the new drive to the **Disk 1** slot +7. Start the array — Unraid will prompt to **rebuild** from parity +8. Rebuild will take many hours for 10TB — do NOT interrupt + +### Step 3: Post-Rebuild +1. Verify ZFS pool `disk1` is healthy: `zpool status disk1` +2. Run parity check from Unraid UI +3. Run SMART extended test on new drive: `smartctl -t long /dev/sdX` +4. Verify all ZFS datasets are intact + +--- + +## Notes + +- Server is safe to run in degraded mode indefinitely, just without parity protection +- Avoid heavy writes if possible to reduce risk to parity drive +- The two NVMe SSDs (cache pool, ZFS mirror) are unaffected +- Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair