From f56a43741d5cdd455e13a96e310fd417be4deded Mon Sep 17 00:00:00 2001 From: Kaloyan Danchev Date: Sat, 31 Jan 2026 20:52:49 +0200 Subject: [PATCH] Update DNS failover with dual health check - Added DNS resolution Netwatch monitor (type=dns) alongside ping - Ping check: Fast container crash detection (10s interval) - DNS check: Actual DNS functionality verification (30s interval) - Either monitor failing triggers failover to Unraid - Documented /32 routing fix for multi-container ECMP issue - Updated troubleshooting section with routing checks Co-Authored-By: Claude Opus 4.5 --- docs/17-DNS-ADGUARD-FAILOVER.md | 93 ++++++++++++++++++++++++++------- 1 file changed, 75 insertions(+), 18 deletions(-) diff --git a/docs/17-DNS-ADGUARD-FAILOVER.md b/docs/17-DNS-ADGUARD-FAILOVER.md index e760a46..e6b3f21 100644 --- a/docs/17-DNS-ADGUARD-FAILOVER.md +++ b/docs/17-DNS-ADGUARD-FAILOVER.md @@ -1,8 +1,9 @@ # DNS Architecture with AdGuard Failover **Created:** 2026-01-31 +**Updated:** 2026-01-31 **Status:** Implemented -**Backup:** `adguard-failover-complete-2026-01-31.backup` +**Backup:** `dns-dual-failover-2026-01-31.backup` --- @@ -115,31 +116,46 @@ add chain=srcnat action=masquerade protocol=udp src-address=192.168.10.0/24 dst- ## Automatic Failover -### How It Works +### How It Works (Dual Health Check) -1. **Netwatch** monitors 172.17.0.2 (container IP) every 10 seconds -2. If ping fails for 3 seconds → status changes to "down" -3. **dns-failover-down** script runs → NAT rules switch to Unraid -4. When ping succeeds again → status changes to "up" -5. **dns-failover-up** script runs → NAT rules switch back to MikroTik +Two independent Netwatch monitors trigger failover: + +| Monitor | Type | What It Checks | Interval | Timeout | +|---------|------|----------------|----------|---------| +| Ping | simple | Container reachable | 10s | 3s | +| DNS | dns | DNS queries work | 30s | 10s | + +**Either monitor failing triggers failover to Unraid.** + +### Failure Scenarios Covered + +| Scenario | Ping Check | DNS Check | Failover? | +|----------|------------|-----------|-----------| +| Container crashed | ✗ Fail | ✗ Fail | ✅ Yes | +| Container stopped | ✗ Fail | ✗ Fail | ✅ Yes | +| Network/routing issue | ✗ Fail | ✗ Fail | ✅ Yes | +| Upstream DNS unreachable | ✓ Pass | ✗ Fail | ✅ Yes | +| AdGuard overloaded | ✓ Pass | ✗ Fail | ✅ Yes | +| Everything working | ✓ Pass | ✓ Pass | ❌ No | ### Failover Timeline | Event | Detection Time | Total Switchover | |-------|----------------|------------------| -| Container stops | ~10-13 seconds | ~13-16 seconds | -| Container recovers | ~10-13 seconds | ~13-16 seconds | +| Container crash (ping) | ~10-13 seconds | ~13-16 seconds | +| DNS failure (resolution) | ~30-40 seconds | ~33-43 seconds | +| Recovery | ~10-30 seconds | Automatic | ### Failover Scripts ```routeros -# dns-failover-down (runs when container is unreachable) +# dns-failover-down (runs when either check fails) /system script add name=dns-failover-down dont-require-permissions=yes source={ :log warning "DNS Failover: Switching to Unraid" /ip firewall nat set [find where comment~"VLAN" and comment~"redirect"] to-addresses=192.168.10.10 to-ports=3000 } -# dns-failover-up (runs when container is back) +# dns-failover-up (runs when check recovers) /system script add name=dns-failover-up dont-require-permissions=yes source={ :log info "DNS Failover: Switching back to MikroTik" /ip firewall nat set [find where comment~"VLAN" and comment~"redirect"] to-addresses=172.17.0.2 to-ports=53 @@ -149,10 +165,15 @@ add chain=srcnat action=masquerade protocol=udp src-address=192.168.10.0/24 dst- ### Netwatch Configuration ```routeros -/tool netwatch add host=172.17.0.2 interval=10s timeout=3s \ - up-script=dns-failover-up \ - down-script=dns-failover-down \ +# Monitor 1: Ping check (fast crash detection) +/tool netwatch add type=simple host=172.17.0.2 interval=10s timeout=3s \ + up-script=dns-failover-up down-script=dns-failover-down \ comment="AdGuard failover monitor" + +# Monitor 2: DNS resolution check (functional verification) +/tool netwatch add type=dns host=google.com interval=30s timeout=10s \ + up-script=dns-failover-up down-script=dns-failover-down \ + comment="AdGuard DNS resolution check" ``` --- @@ -268,7 +289,9 @@ Both AdGuard instances use the same upstream: ```routeros /tool netwatch print -# STATUS should be "up" normally +# Both monitors should show STATUS=up normally +# Monitor 0: Ping check +# Monitor 1: DNS resolution check ``` ### Check Current DNS Target @@ -304,6 +327,21 @@ Both AdGuard instances use the same upstream: 2. Check netwatch status: `/tool netwatch print` 3. Test DNS directly: `:resolve google.com server=172.17.0.2` 4. Check NAT rules: `/ip firewall nat print where comment~"DNS"` +5. **Check /32 routes exist:** `/ip route print where dst-address~"172.17.0.[23]"` +6. **Ping container:** `/ping 172.17.0.2 count=3` + +### Container Reachable but DNS Fails + +If ping works but DNS queries timeout: + +1. Check container can reach upstream: Look for timeout errors in logs +2. Verify /32 routes: Missing routes cause ECMP issues +3. Check NAT masquerade: `/ip firewall nat print where comment~"Container"` +4. Verify routes: +```routeros +/ip route print where dst-address~"172.17" +# Should show /32 routes for each container IP +``` ### Sync Not Working @@ -318,6 +356,22 @@ docker exec adguardhome-sync ping -c 2 192.168.10.1 --- +## Container Network Routing + +### Important: /32 Host Routes Required + +When running multiple containers on the same subnet (172.17.0.0/24), specific host routes are required to prevent ECMP routing issues: + +```routeros +# Without these routes, return traffic may go to wrong container +/ip route add dst-address=172.17.0.2/32 gateway=veth-adguard comment="AdGuard container - specific route" +/ip route add dst-address=172.17.0.3/32 gateway=veth-tailscale comment="Tailscale container - specific route" +``` + +**Why this matters:** Each veth interface creates a /24 route. With multiple veth interfaces on the same subnet, RouterOS enables ECMP load balancing, sending return traffic to random interfaces. + +--- + ## Backups | Backup | Description | @@ -325,12 +379,14 @@ docker exec adguardhome-sync ping -c 2 192.168.10.1 | `pre-adguard-2026-01-31` | Before AdGuard setup | | `adguard-container-running-2026-01-31` | Container working, before NAT | | `adguard-synced-2026-01-31` | After sync configured | -| `adguard-failover-complete-2026-01-31` | Final with failover | +| `adguard-failover-complete-2026-01-31` | Single ping failover | +| `routing-fix-complete-2026-01-31` | After /32 routing fix | +| `dns-dual-failover-2026-01-31` | Dual health check (current) | ### Restore Command ```routeros -/system backup load name=adguard-failover-complete-2026-01-31 +/system backup load name=dns-dual-failover-2026-01-31 ``` --- @@ -354,5 +410,6 @@ docker exec adguardhome-sync ping -c 2 192.168.10.1 --- -**Document Version:** 1.0 +**Document Version:** 1.1 **Last Updated:** 2026-01-31 +**Changes:** Added dual health check (ping + DNS), documented /32 routing fix