Files
infrastructure/docs/CHANGELOG.md
Kaloyan Danchev ecbce1ca94
All checks were successful
ci/woodpecker/push/woodpecker Pipeline was successful
Add VRRP failover infrastructure documentation (Nobara)
Deployed automatic failover for critical services (Traefik, Vaultwarden,
Authentik, AdGuard) from Unraid to Nobara workstation via Keepalived VRRP
with VIP 192.168.10.250. ~4 second failover time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 18:03:26 +02:00

10 KiB

Infrastructure Changelog

Purpose: Major infrastructure events only. Minor changes are in git commit messages.


2026-02-13

Failover Infrastructure Deployed

  • [SERVICE] Deployed Docker failover stack on XTRM-Nobara (Traefik, Vaultwarden, Authentik, AdGuard Home)
  • [SERVICE] Installed Docker CE 29.2.1 + Docker Compose 5.0.2 on Nobara
  • [SERVICE] Deployed Keepalived VRRP for automatic failover (VIP: 192.168.10.250)
  • [SERVICE] Unraid: Keepalived as Docker container (local/keepalived, MASTER priority 150)
  • [SERVICE] Nobara: Keepalived as systemd service (BACKUP priority 100)
  • [SERVICE] Replicated data: Vaultwarden DB, Authentik PostgreSQL dump (864MB), AdGuard config, Traefik certs
  • [NETWORK] Added VRRP protocol to Nobara firewall (firewalld)
  • [NETWORK] Configured SSH key auth to Nobara (id_ed25519_nobara, passwordless)
  • [NETWORK] Added SSH config alias: ssh nobara
  • [DOCS] Created 10-FAILOVER-NOBARA.md with full failover documentation
  • [DOCS] Updated 02-SERVICES-CRITICAL.md with failover section
  • [DOCS] Updated 04-HARDWARE-INVENTORY.md with XTRM-Nobara specs
  • [DOCS] Updated README.md and CLAUDE.md with Nobara references

2026-02-06

Unraid Flash Drive Failure

  • [INCIDENT] Unraid flash drive crashing - migration procedure created
  • [DOCS] Created incident report with full flash drive replacement procedure

Documentation Restructure

  • [DOCS] Restructured docs/ from 23 files to clean 9-doc structure
  • [DOCS] Archived 12 completed VLAN migration project docs to archive/vlan-migration/
  • [DOCS] Archived 5 done/superseded WIP docs (VLAN proposals, AI stack, Fossorial, DNS backup)
  • [DOCS] Created standing reference docs: 08-DNS-ARCHITECTURE.md, 09-TAILSCALE-VPN.md
  • [DOCS] Renamed docs to clean numbering (05-PORT-UTILIZATION, 06-VLAN-DEVICE-ASSIGNMENT, 07-WIFI-CAPSMAN-CONFIG)
  • [DOCS] Merged 00-CHANGELOG.md + 06-CHANGELOG.md → CHANGELOG.md
  • [DOCS] Updated all core docs with current VLAN IPs (192.168.31.x → 192.168.10.x)
  • [DOCS] Fixed CSS1 IP: 192.168.10.9 → 192.168.10.3, ZX1 IP: 192.168.10.7 → 192.168.10.4
  • [DOCS] Cleaned 06-VLAN-DEVICE-ASSIGNMENT.md: removed migration-era columns and sections, fixed VLAN 25 subnet
  • [DOCS] Updated README.md, CLAUDE.md, archive/README.md, wip/README.md

2026-02-01

WIP Documentation

  • [DOCS] Added KVM-SWITCH-MAC-NOBARA.md - Software KVM for Mac/Nobara switching
  • DDC/CI monitor control (Dell U3821DW) + HID++ Logitech peripheral switching
  • Scripts created on Mac at ~/scripts/

2026-01-31

Docker Cleanup

  • [DOCKER] Removed 18 unused images (~4.9 GB reclaimed)
  • [DOCKER] Removed 12 dangling images (old builds, untagged)
  • [DOCKER] Removed Slurpit stack images (warehouse, portal, scanner, scraper)
  • [DOCKER] Removed unused MongoDB 8 and MariaDB 11 images
  • [DOCKER] Removed 35 orphaned volumes (~1.15 GB reclaimed)
  • [DOCKER] Removed 28 anonymous dangling volumes
  • [DOCKER] Removed 6 nextcloud_aio_* volumes (from old AIO install)
  • [DOCKER] Removed orphaned redis-data volume
  • [DOCKER] Total reclaimed: ~6 GB

Kept (Stopped Containers)

  • open-webui, ollama (AI stack - for future use)
  • pgAdmin4 (database management)
  • diode-hydra-migrate, diode-auth-bootstrap (one-time migration jobs)

2026-01-27

VLAN Filtering Rolled Back

  • [VLAN] Enabled VLAN filtering - caused connectivity issues
  • [VLAN] ZX1 switch unreachable after activation (no management IP responding)
  • [VLAN] CSS326 traffic routing through ZX1 (not direct eth3 link)
  • [VLAN] Rolled back - VLAN filtering disabled
  • [CONFIG] Added eth4 (ZX1) to all VLAN tagged lists for future use
  • [STATUS] Network back to Legacy mode (192.168.31.0/24)
  • [TODO] Need physical access to ZX1 to configure VLAN trunking

Issues Identified

  • ZX1 switch not responding on documented IP 192.168.31.22
  • ZX1 may need VLAN trunk configuration before re-enabling filtering
  • All CSS326 traffic goes via ZX1→HAP1, not direct CSS326→HAP1 link (STP?)

2026-01-26

VLAN Filtering Activated

  • [VLAN] VLAN filtering enabled on MikroTik bridge - SUCCESSFUL
  • [VLAN] Internet connectivity verified (ping 1.1.1.1, google.com)
  • [VLAN] DNS resolution working through AdGuard
  • [VLAN] All previous fixes (DHCP DNS, firewall, NAT masquerade) working correctly
  • [STATUS] Network segmentation now ACTIVE

Local AI Stack Deployed

  • [AI] Deployed Ollama container with Intel GPU passthrough
  • [AI] Deployed Open WebUI at http://192.168.31.2:3080
  • [AI] Installed qwen2.5-coder:7b base model
  • [AI] Created custom unraid-assistant model with infrastructure knowledge
  • [AI] Created /usr/local/bin/ai terminal helper command
  • [AI] Stopped non-critical containers for RAM: karakeep, unimus, homarr, netdisco-*

VLAN Activation Attempt & Fixes

  • [VLAN] Configured CSS326 switch VLANs via SwOS web interface
  • [VLAN] Enabled VLAN filtering on MikroTik - caused internet outage
  • [VLAN] Rolled back VLAN filtering to restore connectivity
  • [VLAN] ROOT CAUSE IDENTIFIED: Multiple configuration issues

Issues Fixed

  • [FIX] DHCP DNS now points to each VLAN gateway instead of legacy 192.168.31.1
  • [FIX] Added DNS redirect rules for all VLANs (src-address-list=all-vlans)
  • [FIX] Added all VLAN interfaces to LAN firewall interface list
  • [FIX] Added NAT masquerade rules for VLAN traffic to AdGuard container
  • [BACKUP] MikroTik config saved before activation attempt

2026-01-25

VLAN Phase 1 Complete

  • [VLAN] Added VLAN 25 (Kids) - interface, IP, DHCP server, pool, bridge entry
  • [VLAN] Fixed VLAN 10 (Management) leases - correct IPs per device assignment doc
  • [VLAN] Fixed VLAN 30 (IoT) leases - all 14 devices with correct IPs
  • [VLAN] Added VLAN 25 (Kids) leases - 6 devices including XTRM-Ally
  • [VLAN] Added VLAN 50 (Guest) leases - 7 unknown devices
  • [VLAN] Added firewall rules for VLAN 25 (Kids → IoT, Legacy, DNS)
  • [VLAN] Total devices configured: 44

VLAN Implementation (Prepared)

  • [VLAN] Created 6 VLANs on MikroTik bridge (10, 20, 30, 35, 40, 50)
  • [VLAN] Configured IP addresses for all VLAN interfaces
  • [VLAN] Created DHCP servers and pools for each VLAN
  • [VLAN] Added static DHCP leases mapping MACs to VLAN IPs
  • [VLAN] Configured bridge VLAN table with tagged/untagged ports
  • [VLAN] Set WiFi ports PVID=20 (Trusted VLAN default)
  • [VLAN] Added inter-VLAN firewall rules (active)
  • [VLAN] VLAN filtering NOT YET ENABLED (pending CSS326 switch config)
  • [DOCS] Added docs/11-VLAN-IMPLEMENTATION.md
  • [SCRIPTS] Added scripts/mikrotik-vlan-setup.rsc and mikrotik-vlan-enable.rsc

DNS Configuration

  • [DNS] Updated both AdGuard instances to use Quad9 DoH
  • [DNS] Bootstrap DNS: 9.9.9.9, 149.112.112.112

MikroTik Containers

  • [CONTAINER] AdGuard Home container running on MikroTik (172.17.0.2)
  • [CONTAINER] Tailscale container configured (172.17.0.3)
  • [CONTAINER] Fixed Tailscale container authentication
  • [CONTAINER] Container bridge (containers-br) with NAT

Network

  • [NETWORK] Enabled CSS326 SFP1 port - 10G backbone link to ZX1 now active

Documentation

  • [DOCS] Created 02-PORT-UTILIZATION.md with ASCII port diagrams
  • [DOCS] Fixed ZX1 switch IP: 192.168.31.22 (was incorrectly documented as .7)

Incident

  • [INCIDENT] DNS outage after MikroTik restart - multiple root causes fixed:
    • NAT rules blocking AdGuard outbound DNS (added exception rules)
    • DHCP pushing wrong DNS (8.8.8.8 → 192.168.31.1)
    • NAT redirect pointing to wrong IP/port (172.17.0.5:5355 → 192.168.31.4:53)
    • Asymmetric routing (added srcnat masquerade for DNS redirect)
  • [SERVICE] Removed MikroTik AdGuard Home container (storage/overlay errors)
  • [SERVICE] Removed MikroTik Tailscale container (root directory missing)
  • [SERVICE] Removed Pi-hole/Unbound leftovers from MikroTik (veth, mounts, envs)
  • [NETWORK] Consolidated DNS architecture: MikroTik → Unraid AdGuard (192.168.31.4) only
  • [DOCS] Created incident reports in docs/incidents/
  • [DOCS] Restructured documentation - consolidated into 5 core docs + archive
  • [NETBOX] Added shelf devices for rack organization (U9, U7, U3)

2026-01-24

  • [NETBOX] Standardized device names to NetBox convention (HAP1, CSS1, ZX1)
  • [DOCS] Created NETWORK-PHYSICAL-MAP.md with complete port maps

2026-01-23

  • [SERVICE] Deployed Diode network discovery stack
  • [SERVICE] Removed Slurp'it (replaced by Diode + NetDisco)
  • [SERVICE] Consolidated NetBox Redis to shared instance
  • [SERVICE] Removed redundant DNS services (Unbound, DoH-Server, stunnel-dot)

2026-01-22

  • [SERVICE] Migrated NetBox to shared PostgreSQL 17
  • [SERVICE] Deployed AdGuard Home on MikroTik (primary DNS)
  • [SERVICE] Deployed AdGuard Home on Unraid (secondary DNS)
  • [SERVICE] Removed Pi-hole (replaced by AdGuard Home)
  • [DOCS] Created INFRASTRUCTURE-DIAGRAM.md

2026-01-21

  • [BACKUP] Configured Rclone sync to Google Drive

2026-01-19

  • [SERVICE] Deployed NetBox IPAM/DCIM
  • [SERVICE] Deployed NetDisco network discovery
  • [NETWORK] Enabled SNMP on all MikroTik devices

2026-01-18

  • [SERVICE] Deployed Gitea git server
  • [SERVICE] Deployed Woodpecker CI
  • [NETWORK] Configured CAPsMAN on HAP1
  • [WIRELESS] CAP added to CAPsMAN management

2026-01-17

  • [SERVICE] Deployed Portainer CE

Previous History

For detailed history before 2026-01-17, see archived changelogs in archive/.


Format Guide

### YYYY-MM-DD
- **[CATEGORY]** Brief description

Categories:
- [DEVICE] - Hardware added/removed/changed
- [SERVICE] - Container/service deployed/removed
- [NETWORK] - Network topology/config changes
- [WIRELESS] - WiFi/CAPsMAN changes
- [BACKUP] - Backup configuration
- [DOCS] - Major documentation changes
- [INCIDENT] - Outages and fixes
- [VLAN] - VLAN configuration changes
- [DOCKER] - Docker maintenance