From bf6a62a275364814fad604cab0aba649cda89fa6 Mon Sep 17 00:00:00 2001
From: Kaloyan Danchev <kaloyandanchev@Kaloyans-MacBook-Air.local>
Date: Sun, 22 Feb 2026 17:54:23 +0200
Subject: [PATCH] Add incident report: disk1 hardware failure (clicking/head
 crash)

HGST Ultrastar 10TB drive (serial 2TKK3K1D) failed on Feb 18.
Array running degraded on parity emulation. Recovery plan documented.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
---
 .../2026-02-20-disk1-hardware-failure.md      | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)
 create mode 100644 docs/incidents/2026-02-20-disk1-hardware-failure.md

diff --git a/docs/incidents/2026-02-20-disk1-hardware-failure.md b/docs/incidents/2026-02-20-disk1-hardware-failure.md
new file mode 100644
index 0000000..6bb8c0f
--- /dev/null
+++ b/docs/incidents/2026-02-20-disk1-hardware-failure.md
@@ -0,0 +1,88 @@
+# Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure)
+
+**Date:** 2026-02-20
+**Severity:** P2 - Degraded (no redundancy)
+**Status:** Open — awaiting replacement drive
+**Affected:** XTRM-U (Unraid NAS) — disk1 (data drive)
+
+---
+
+## Summary
+
+disk1 (10TB HGST Ultrastar HUH721010ALE601, serial `2TKK3K1D`) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in **degraded/emulated mode**, reconstructing disk1 data from parity on the fly. All data is intact but there is **zero redundancy**.
+
+---
+
+## Timeline
+
+| When | What |
+|------|------|
+| Feb 18 ~19:15 | `ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up` → `ata5.00: disable device` |
+| Feb 18 19:17 | `super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors) |
+| Feb 20 13:14 | Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors |
+| Feb 20 ~13:30 | Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed |
+
+## Drive Details
+
+| Field | Value |
+|-------|-------|
+| Model | HUH721010ALE601 (HGST/WD Ultrastar He10) |
+| Serial | 2TKK3K1D |
+| Capacity | 10TB (9766436812 sectors) |
+| Array slot | disk1 (slot 1) |
+| Filesystem | ZFS (on md1p1) |
+| Last known device | sdc |
+| Accumulated md errors | 213 |
+
+## Current State
+
+- **Array**: STARTED, degraded — disk1 emulated from parity (`sdb`)
+- **ZFS pool `disk1`**: ONLINE, 0 errors, mounted on `md1p1` (parity reconstruction)
+- **Parity drive** (`sdb`, serial `7PHBNYZC`): DISK_OK, 0 errors
+- **All services**: Running normally (Docker containers, VMs)
+- **Risk**: If parity drive fails, data is **unrecoverable**
+
+## Diagnosis
+
+- Drive fails on multiple SATA ports → not a port/cable issue
+- Clicking noise on boot → mechanical head failure
+- dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead
+- Drive is beyond DIY repair
+
+## Root Cause
+
+Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure.
+
+---
+
+## Recovery Plan
+
+### Step 1: Get Replacement Drive
+- Must be 10TB or larger
+- Check WD warranty: serial `HUH721010ALE601_2TKK3K1D` at https://support-en.wd.com/app/warrantycheck
+- Any 3.5" SATA drive works (doesn't need to match model)
+
+### Step 2: Install & Rebuild
+1. Power off the server
+2. Remove dead drive, install replacement in any SATA port
+3. Boot Unraid
+4. Go to **Main** → click on **Disk 1** (will show as "Not installed" or unmapped)
+5. Stop the array
+6. Assign the new drive to the **Disk 1** slot
+7. Start the array — Unraid will prompt to **rebuild** from parity
+8. Rebuild will take many hours for 10TB — do NOT interrupt
+
+### Step 3: Post-Rebuild
+1. Verify ZFS pool `disk1` is healthy: `zpool status disk1`
+2. Run parity check from Unraid UI
+3. Run SMART extended test on new drive: `smartctl -t long /dev/sdX`
+4. Verify all ZFS datasets are intact
+
+---
+
+## Notes
+
+- Server is safe to run in degraded mode indefinitely, just without parity protection
+- Avoid heavy writes if possible to reduce risk to parity drive
+- The two NVMe SSDs (cache pool, ZFS mirror) are unaffected
+- Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair