infrastructure/docs/wip/LOCAL-AI-STACK.md

# Local AI Stack on Unraid

**Status:** ✅ Deployed
**Last Updated:** 2026-01-26

---

## Current Deployment

| Component | Status | URL/Port |
|-----------|--------|----------|
| Ollama | ✅ Running | http://192.168.31.2:11434 |
| Open WebUI | ✅ Running | http://192.168.31.2:3080 |
| Intel GPU | ✅ Enabled | /dev/dri passthrough |

### Model Installed

| Model | Size | Type |
|-------|------|------|
| qwen2.5-coder:7b | 4.7 GB | Code-focused LLM |

---

## Hardware

| Component | Spec |
|-----------|------|
| CPU | Intel N100 |
| RAM | 16GB (shared with Docker) |
| GPU | Intel UHD (iGPU via /dev/dri) |
| Storage | 1.7TB free on array |

---

## Containers Stopped for RAM

To free ~4.8GB for AI workloads, these non-critical containers were stopped:

| Container | RAM Freed | Purpose |
|-----------|-----------|---------|
| karakeep | 1.68 GB | Bookmark manager |
| unimus | 1.62 GB | Network backup |
| homarr | 686 MB | Dashboard |
| netdisco-web | 531 MB | Network discovery UI |
| netdisco-backend | 291 MB | Network discovery |

To restart if needed:
```bash
docker start karakeep unimus homarr netdisco-web netdisco-backend
```

---

## Docker Configuration

### Ollama
```bash
docker run -d \
  --name ollama \
  --restart unless-stopped \
  --device /dev/dri \
  -v /mnt/user/appdata/ollama:/root/.ollama \
  -p 11434:11434 \
  ollama/ollama
```

### Open WebUI
```bash
docker run -d \
  --name open-webui \
  --restart unless-stopped \
  -p 3080:8080 \
  -e OLLAMA_BASE_URL=http://192.168.31.2:11434 \
  -v /mnt/user/appdata/open-webui:/app/backend/data \
  ghcr.io/open-webui/open-webui:main
```

---

## Usage

### Web Interface
1. Open http://192.168.31.2:3080
2. Create admin account on first visit
3. Select `qwen2.5-coder:7b` model
4. Start chatting

### API Access
```bash
# List models
curl http://192.168.31.2:11434/api/tags

# Generate response (example)
curl http://192.168.31.2:11434/api/generate \
  -d '{"model": "qwen2.5-coder:7b", "prompt": "Hello"}'
```

---

## Future Considerations

- **More RAM:** With 32GB+ RAM, could run larger models (14b, 32b)
- **Dedicated GPU:** Would significantly improve inference speed
- **Additional models:** Can pull more models as needed with `docker exec ollama ollama pull <model>`