Update DNS failover with dual health check
- Added DNS resolution Netwatch monitor (type=dns) alongside ping - Ping check: Fast container crash detection (10s interval) - DNS check: Actual DNS functionality verification (30s interval) - Either monitor failing triggers failover to Unraid - Documented /32 routing fix for multi-container ECMP issue - Updated troubleshooting section with routing checks Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -1,8 +1,9 @@
|
|||||||
# DNS Architecture with AdGuard Failover
|
# DNS Architecture with AdGuard Failover
|
||||||
|
|
||||||
**Created:** 2026-01-31
|
**Created:** 2026-01-31
|
||||||
|
**Updated:** 2026-01-31
|
||||||
**Status:** Implemented
|
**Status:** Implemented
|
||||||
**Backup:** `adguard-failover-complete-2026-01-31.backup`
|
**Backup:** `dns-dual-failover-2026-01-31.backup`
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -115,31 +116,46 @@ add chain=srcnat action=masquerade protocol=udp src-address=192.168.10.0/24 dst-
|
|||||||
|
|
||||||
## Automatic Failover
|
## Automatic Failover
|
||||||
|
|
||||||
### How It Works
|
### How It Works (Dual Health Check)
|
||||||
|
|
||||||
1. **Netwatch** monitors 172.17.0.2 (container IP) every 10 seconds
|
Two independent Netwatch monitors trigger failover:
|
||||||
2. If ping fails for 3 seconds → status changes to "down"
|
|
||||||
3. **dns-failover-down** script runs → NAT rules switch to Unraid
|
| Monitor | Type | What It Checks | Interval | Timeout |
|
||||||
4. When ping succeeds again → status changes to "up"
|
|---------|------|----------------|----------|---------|
|
||||||
5. **dns-failover-up** script runs → NAT rules switch back to MikroTik
|
| Ping | simple | Container reachable | 10s | 3s |
|
||||||
|
| DNS | dns | DNS queries work | 30s | 10s |
|
||||||
|
|
||||||
|
**Either monitor failing triggers failover to Unraid.**
|
||||||
|
|
||||||
|
### Failure Scenarios Covered
|
||||||
|
|
||||||
|
| Scenario | Ping Check | DNS Check | Failover? |
|
||||||
|
|----------|------------|-----------|-----------|
|
||||||
|
| Container crashed | ✗ Fail | ✗ Fail | ✅ Yes |
|
||||||
|
| Container stopped | ✗ Fail | ✗ Fail | ✅ Yes |
|
||||||
|
| Network/routing issue | ✗ Fail | ✗ Fail | ✅ Yes |
|
||||||
|
| Upstream DNS unreachable | ✓ Pass | ✗ Fail | ✅ Yes |
|
||||||
|
| AdGuard overloaded | ✓ Pass | ✗ Fail | ✅ Yes |
|
||||||
|
| Everything working | ✓ Pass | ✓ Pass | ❌ No |
|
||||||
|
|
||||||
### Failover Timeline
|
### Failover Timeline
|
||||||
|
|
||||||
| Event | Detection Time | Total Switchover |
|
| Event | Detection Time | Total Switchover |
|
||||||
|-------|----------------|------------------|
|
|-------|----------------|------------------|
|
||||||
| Container stops | ~10-13 seconds | ~13-16 seconds |
|
| Container crash (ping) | ~10-13 seconds | ~13-16 seconds |
|
||||||
| Container recovers | ~10-13 seconds | ~13-16 seconds |
|
| DNS failure (resolution) | ~30-40 seconds | ~33-43 seconds |
|
||||||
|
| Recovery | ~10-30 seconds | Automatic |
|
||||||
|
|
||||||
### Failover Scripts
|
### Failover Scripts
|
||||||
|
|
||||||
```routeros
|
```routeros
|
||||||
# dns-failover-down (runs when container is unreachable)
|
# dns-failover-down (runs when either check fails)
|
||||||
/system script add name=dns-failover-down dont-require-permissions=yes source={
|
/system script add name=dns-failover-down dont-require-permissions=yes source={
|
||||||
:log warning "DNS Failover: Switching to Unraid"
|
:log warning "DNS Failover: Switching to Unraid"
|
||||||
/ip firewall nat set [find where comment~"VLAN" and comment~"redirect"] to-addresses=192.168.10.10 to-ports=3000
|
/ip firewall nat set [find where comment~"VLAN" and comment~"redirect"] to-addresses=192.168.10.10 to-ports=3000
|
||||||
}
|
}
|
||||||
|
|
||||||
# dns-failover-up (runs when container is back)
|
# dns-failover-up (runs when check recovers)
|
||||||
/system script add name=dns-failover-up dont-require-permissions=yes source={
|
/system script add name=dns-failover-up dont-require-permissions=yes source={
|
||||||
:log info "DNS Failover: Switching back to MikroTik"
|
:log info "DNS Failover: Switching back to MikroTik"
|
||||||
/ip firewall nat set [find where comment~"VLAN" and comment~"redirect"] to-addresses=172.17.0.2 to-ports=53
|
/ip firewall nat set [find where comment~"VLAN" and comment~"redirect"] to-addresses=172.17.0.2 to-ports=53
|
||||||
@@ -149,10 +165,15 @@ add chain=srcnat action=masquerade protocol=udp src-address=192.168.10.0/24 dst-
|
|||||||
### Netwatch Configuration
|
### Netwatch Configuration
|
||||||
|
|
||||||
```routeros
|
```routeros
|
||||||
/tool netwatch add host=172.17.0.2 interval=10s timeout=3s \
|
# Monitor 1: Ping check (fast crash detection)
|
||||||
up-script=dns-failover-up \
|
/tool netwatch add type=simple host=172.17.0.2 interval=10s timeout=3s \
|
||||||
down-script=dns-failover-down \
|
up-script=dns-failover-up down-script=dns-failover-down \
|
||||||
comment="AdGuard failover monitor"
|
comment="AdGuard failover monitor"
|
||||||
|
|
||||||
|
# Monitor 2: DNS resolution check (functional verification)
|
||||||
|
/tool netwatch add type=dns host=google.com interval=30s timeout=10s \
|
||||||
|
up-script=dns-failover-up down-script=dns-failover-down \
|
||||||
|
comment="AdGuard DNS resolution check"
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -268,7 +289,9 @@ Both AdGuard instances use the same upstream:
|
|||||||
|
|
||||||
```routeros
|
```routeros
|
||||||
/tool netwatch print
|
/tool netwatch print
|
||||||
# STATUS should be "up" normally
|
# Both monitors should show STATUS=up normally
|
||||||
|
# Monitor 0: Ping check
|
||||||
|
# Monitor 1: DNS resolution check
|
||||||
```
|
```
|
||||||
|
|
||||||
### Check Current DNS Target
|
### Check Current DNS Target
|
||||||
@@ -304,6 +327,21 @@ Both AdGuard instances use the same upstream:
|
|||||||
2. Check netwatch status: `/tool netwatch print`
|
2. Check netwatch status: `/tool netwatch print`
|
||||||
3. Test DNS directly: `:resolve google.com server=172.17.0.2`
|
3. Test DNS directly: `:resolve google.com server=172.17.0.2`
|
||||||
4. Check NAT rules: `/ip firewall nat print where comment~"DNS"`
|
4. Check NAT rules: `/ip firewall nat print where comment~"DNS"`
|
||||||
|
5. **Check /32 routes exist:** `/ip route print where dst-address~"172.17.0.[23]"`
|
||||||
|
6. **Ping container:** `/ping 172.17.0.2 count=3`
|
||||||
|
|
||||||
|
### Container Reachable but DNS Fails
|
||||||
|
|
||||||
|
If ping works but DNS queries timeout:
|
||||||
|
|
||||||
|
1. Check container can reach upstream: Look for timeout errors in logs
|
||||||
|
2. Verify /32 routes: Missing routes cause ECMP issues
|
||||||
|
3. Check NAT masquerade: `/ip firewall nat print where comment~"Container"`
|
||||||
|
4. Verify routes:
|
||||||
|
```routeros
|
||||||
|
/ip route print where dst-address~"172.17"
|
||||||
|
# Should show /32 routes for each container IP
|
||||||
|
```
|
||||||
|
|
||||||
### Sync Not Working
|
### Sync Not Working
|
||||||
|
|
||||||
@@ -318,6 +356,22 @@ docker exec adguardhome-sync ping -c 2 192.168.10.1
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## Container Network Routing
|
||||||
|
|
||||||
|
### Important: /32 Host Routes Required
|
||||||
|
|
||||||
|
When running multiple containers on the same subnet (172.17.0.0/24), specific host routes are required to prevent ECMP routing issues:
|
||||||
|
|
||||||
|
```routeros
|
||||||
|
# Without these routes, return traffic may go to wrong container
|
||||||
|
/ip route add dst-address=172.17.0.2/32 gateway=veth-adguard comment="AdGuard container - specific route"
|
||||||
|
/ip route add dst-address=172.17.0.3/32 gateway=veth-tailscale comment="Tailscale container - specific route"
|
||||||
|
```
|
||||||
|
|
||||||
|
**Why this matters:** Each veth interface creates a /24 route. With multiple veth interfaces on the same subnet, RouterOS enables ECMP load balancing, sending return traffic to random interfaces.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## Backups
|
## Backups
|
||||||
|
|
||||||
| Backup | Description |
|
| Backup | Description |
|
||||||
@@ -325,12 +379,14 @@ docker exec adguardhome-sync ping -c 2 192.168.10.1
|
|||||||
| `pre-adguard-2026-01-31` | Before AdGuard setup |
|
| `pre-adguard-2026-01-31` | Before AdGuard setup |
|
||||||
| `adguard-container-running-2026-01-31` | Container working, before NAT |
|
| `adguard-container-running-2026-01-31` | Container working, before NAT |
|
||||||
| `adguard-synced-2026-01-31` | After sync configured |
|
| `adguard-synced-2026-01-31` | After sync configured |
|
||||||
| `adguard-failover-complete-2026-01-31` | Final with failover |
|
| `adguard-failover-complete-2026-01-31` | Single ping failover |
|
||||||
|
| `routing-fix-complete-2026-01-31` | After /32 routing fix |
|
||||||
|
| `dns-dual-failover-2026-01-31` | Dual health check (current) |
|
||||||
|
|
||||||
### Restore Command
|
### Restore Command
|
||||||
|
|
||||||
```routeros
|
```routeros
|
||||||
/system backup load name=adguard-failover-complete-2026-01-31
|
/system backup load name=dns-dual-failover-2026-01-31
|
||||||
```
|
```
|
||||||
|
|
||||||
---
|
---
|
||||||
@@ -354,5 +410,6 @@ docker exec adguardhome-sync ping -c 2 192.168.10.1
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
**Document Version:** 1.0
|
**Document Version:** 1.1
|
||||||
**Last Updated:** 2026-01-31
|
**Last Updated:** 2026-01-31
|
||||||
|
**Changes:** Added dual health check (ping + DNS), documented /32 routing fix
|
||||||
|
|||||||
Reference in New Issue
Block a user