# Incident: Disk1 Hardware Failure (Clicking / SATA Link Failure) **Date:** 2026-02-20 **Severity:** P2 - Degraded (no redundancy) **Status:** Open — awaiting replacement drive (motherboard replaced, NVMe cache pool added Feb 24) **Affected:** XTRM-U (Unraid NAS) — disk1 (data drive) --- ## Summary disk1 (10TB HGST Ultrastar HUH721010ALE601, serial `2TKK3K1D`) has physically failed. The drive dropped off the SATA bus on Feb 18 at 19:15 and is now exhibiting clicking (head failure). The Unraid md array is running in **degraded/emulated mode**, reconstructing disk1 data from parity on the fly. All data is intact but there is **zero redundancy**. --- ## Timeline | When | What | |------|------| | Feb 18 ~19:15 | `ata5: qc timeout` → multiple hard/soft resets → `reset failed, giving up` → `ata5.00: disable device` | | Feb 18 19:17 | `super.dat` updated — md array marked disk1 as `DISK_DSBL` (213 errors) | | Feb 20 13:14 | Investigation started. `sdc` completely absent from `/dev/`. ZFS pool `disk1` running on emulated `md1p1` with 0 errors | | Feb 20 ~13:30 | Server rebooted, disk moved to new SATA port (ata5 → ata6). Same failure: `ata6: reset failed, giving up`. Clicking noise confirmed | | Feb 24 | Motherboard replaced. Dead drive confirmed still dead on new hardware. New SATA port assignment. Drive is mechanically failed (clicking heads) | | Feb 24 | New cache pool created: 3x Samsung 990 EVO Plus 1TB NVMe, ZFS RAIDZ1. Docker migrated from HDD loopback to NVMe ZFS | ## Drive Details | Field | Value | |-------|-------| | Model | HUH721010ALE601 (HGST/WD Ultrastar He10) | | Serial | 2TKK3K1D | | Capacity | 10TB (9766436812 sectors) | | Array slot | disk1 (slot 1) | | Filesystem | ZFS (on md1p1) | | Last known device | sdc | | Accumulated md errors | 213 | ## Current State - **Array**: STARTED, degraded — disk1 emulated from parity (`sdb`) - **ZFS pool `disk1`**: ONLINE, 0 errors, mounted on `md1p1` (parity reconstruction) - **Parity drive** (`sdb`, serial `7PHBNYZC`): DISK_OK, 0 errors - **All services**: Running normally (Docker containers, VMs) - **Risk**: If parity drive fails, data is **unrecoverable** ## Diagnosis - Drive fails on multiple SATA ports → not a port/cable issue - Clicking noise on boot → mechanical head failure - dmesg shows link responds but device never becomes ready → drive electronics partially functional, platters/heads dead - Drive is beyond DIY repair ## Root Cause Mechanical failure of the hard drive (clicking = head crash or seized actuator). Not related to cache drive migration that happened around the same time — confirmed by syslog showing clean SATA link failure. --- ## Recovery Plan ### Step 1: Get Replacement Drive - Must be 10TB or larger - Check WD warranty: serial `HUH721010ALE601_2TKK3K1D` at https://support-en.wd.com/app/warrantycheck - Any 3.5" SATA drive works (doesn't need to match model) ### Step 2: Install & Rebuild 1. Power off the server 2. Remove dead drive, install replacement in any SATA port 3. Boot Unraid 4. Go to **Main** → click on **Disk 1** (will show as "Not installed" or unmapped) 5. Stop the array 6. Assign the new drive to the **Disk 1** slot 7. Start the array — Unraid will prompt to **rebuild** from parity 8. Rebuild will take many hours for 10TB — do NOT interrupt ### Step 3: Post-Rebuild 1. Verify ZFS pool `disk1` is healthy: `zpool status disk1` 2. Run parity check from Unraid UI 3. Run SMART extended test on new drive: `smartctl -t long /dev/sdX` 4. Verify all ZFS datasets are intact --- ## Notes - Server is safe to run in degraded mode indefinitely, just without parity protection - Avoid heavy writes if possible to reduce risk to parity drive - New cache pool (3x Samsung 990 EVO Plus 1TB, ZFS RAIDZ1) now hosts all Docker containers - Old docker.img loopback deleted from disk1 (200GB freed) - Since disk1 uses ZFS on md, the rebuild reconstructs the raw block device — ZFS doesn't need any separate repair