portfolio/agentlab/homelab-build-plan.md
AgentLab d5ef629a54 feat: initial AgentLab portfolio content
Architecture, overview, homelab build plan, agent handbook, ADRs,
and agent operating rules. All sensitive operational details sanitized
(real IPs, hostnames, client names replaced with generic placeholders).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 04:52:42 +00:00

244 lines
7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Home Lab Build Plan — HP Z640
## Hardware
| Component | Detail |
|---|---|
| **System** | HP Z640 Workstation |
| **CPU** | Intel Xeon (workstation class) |
| **RAM** | 64 GB ECC |
| **OS Storage** | 2× Samsung 850 EVO 500 GB — ZFS mirror (`rpool`) |
| **Data Storage** | 4× Seagate 2 TB — ZFS RAIDZ2 encrypted (data pool) |
| **GPU 1** | Intel Arc A310 (Sparkle) 4 GB — Jellyfin VA-API transcoding |
| **GPU 2** | EVGA GeForce RTX 3060 XC 12 GB GDDR6 — Ollama local LLM inference |
| **Current state** | Proxmox VE installed, organic/messy config — scheduled for clean rebuild |
---
## Phase Overview
```mermaid
gantt
title HP Z640 Rebuild — Phase Sequence
dateFormat YYYY-MM-DD
axisFormat Phase
section Prerequisite
Phase 0a — Pre-audit (SSH) :crit, p0a, 2026-04-01, 1d
Phase 0b — USB backup :crit, p0b, after p0a, 1d
section Core Build
Phase 1 — Proxmox clean install :crit, p1, after p0b, 2d
Phase 2 — Core infrastructure LXCs :p2, after p1, 2d
section Services
Phase 3 — Media stack :p3, after p2, 2d
Phase 4a — Networking + security :p4a, after p3, 1d
Phase 4b — Agent stack (trillian) :p4b, after p4a, 2d
section Automation
Phase 5 — IaC + automation :p5, after p4b, 3d
```
---
## Phase 0a: Pre-Audit
> **GATE — Nothing proceeds until this is complete.**
Capture the current state of the Z640 before any destructive action. The rebuild will wipe LXC and VM configuration.
**Scope:**
- [ ] ZFS pool layout (`rpool` mirror + data pool RAIDZ2) — names, health, encryption status
- [ ] VM and LXC inventory — all IDs, names, disk sizes, network config
- [ ] Arr stack config and data paths (Sonarr, Radarr, Prowlarr, etc.)
- [ ] Jellyfin config path and media library paths
- [ ] PBS datastore paths and retention config
- [ ] Network config — bridges, VLANs, IP assignments
- [ ] Cron jobs — all scheduled tasks
- [ ] Running services summary
---
## Phase 0b: USB Backup
> **GATE — USB backup must complete before Phase 1. No exceptions.**
Full backup of the ZFS data pool to external USB before any rebuild touches storage.
- [ ] Attach external USB drive to Z640
- [ ] Verify USB drive capacity (must exceed used space on data pool)
- [ ] Export pool snapshot and send to USB
```bash
# Capture used space first
zpool list
zfs list
# Send encrypted data pool to USB (adjust pool/dataset names from audit output)
zfs snapshot datapool@pre-rebuild
zfs send -R datapool@pre-rebuild | pv > /mnt/usb/datapool-pre-rebuild.zfs
# Verify send completed without error
echo "Exit code: $?"
```
---
## Phase 1: Proxmox Clean Install
> **GATE — Phase 0a audit complete. Phase 0b USB backup verified.**
Fresh Proxmox VE install. Import existing ZFS pools. Establish baseline network config.
- [ ] Download latest stable Proxmox VE ISO
- [ ] Write ISO to USB installer
- [ ] Boot Z640 from installer USB
- [ ] Install Proxmox VE — **do not touch the data pool disks**
- [ ] Import data pool:
```bash
zpool import -f datapool
zfs load-key datapool
zfs mount -a
```
- [ ] Verify pool health: `zpool status && zfs list`
### Network Config
VLAN scheme: `10.42.0.0/16` supernet. VLAN ID = third octet of the subnet.
| VLAN ID | Subnet | Purpose |
|---|---|---|
| 10 | 10.42.10.0/24 | Management |
| 20 | 10.42.20.0/24 | LAN / trusted devices |
| 60 | 10.42.60.0/24 | AI-Agents |
---
## Phase 2: Core Infrastructure LXCs
> **GATE — Proxmox clean install complete. ZFS pools healthy.**
### 2a — PBS LXC (Proxmox Backup Server)
- [ ] Create LXC for PBS (unprivileged, Debian base)
- [ ] Assign a datastore path on the data pool
- [ ] Configure PBS retention policy
- [ ] Register PBS in Proxmox
- [ ] Test backup of a throwaway LXC
### 2b — WireGuard LXC
- [ ] Create LXC for WireGuard
- [ ] Install WireGuard
- [ ] Configure as spoke to CHR01
### 2c — Monitoring LXC
- [ ] Create LXC for monitoring stack
- [ ] Install Prometheus + Grafana
- [ ] Add Proxmox node as scrape target
- [ ] Basic dashboard: CPU, RAM, ZFS pool health, network
---
## Phase 3: Media Stack
> **GATE — Phase 2 complete. ZFS data pool mounted and healthy.**
### 3a — Jellyfin LXC with Intel Arc A310
- [ ] Create LXC (privileged — required for GPU passthrough)
- [ ] Pass through Intel Arc A310 via IOMMU / device passthrough
- [ ] Install Jellyfin
- [ ] Bind-mount media library paths from ZFS data pool
- [ ] Configure VA-API hardware transcoding
```bash
# Verify VA-API inside LXC
vainfo
# Expected: shows Intel iHD driver, H264/HEVC encode/decode profiles
```
### 3b — Arr Stack LXCs or Docker
- [ ] Determine migration target: individual LXCs or single Docker LXC
- [ ] Restore arr config from paths captured in audit
- [ ] Verify indexer connectivity (Prowlarr)
- [ ] Verify download client connectivity
- [ ] Verify library scan in Sonarr/Radarr against restored media paths
---
## Phase 4a: Networking + Security
> **GATE — Media stack verified functional.**
- [ ] All LXCs assigned to correct VLANs
- [ ] Proxmox firewall rules: deny inter-VLAN by default, permit explicitly
- [ ] VLAN 60 (AI-Agents) isolated — only permitted outbound: DNS, HTTPS, WireGuard tunnel
- [ ] WireGuard tunnel to CHR01 confirmed up and passing traffic
---
## Phase 4b: Agent Stack — Open WebUI (LXC: `trillian`, VMID 112, VLAN 60)
> **GATE — Phase 4a network config complete. VLAN 60 operational.**
**Goal:** Deploy Open WebUI backed by Ollama on the RTX 3060.
### Architecture
```mermaid
flowchart TD
User["User (VPN connected)"]
VPS01["VPS01\nCaddy reverse proxy\ntherapon.yourdomain.com"]
WG["WireGuard tunnel\nCHR01 ↔ trillian"]
Caddy["Caddy (trillian LXC)\nInternal reverse proxy"]
WebUI["Open WebUI\nDocker container"]
Ollama["Ollama\nDocker container"]
GPU["RTX 3060 XC 12 GB\nGPU passthrough"]
User --> VPS01
VPS01 --> WG
WG --> Caddy
Caddy --> WebUI
WebUI --> Ollama
Ollama --> GPU
```
### Tasks
- [ ] Create privileged LXC `trillian` — VMID 112, VLAN 60, Debian 12
- [ ] Pass through EVGA RTX 3060 via IOMMU
- [ ] Install Docker inside LXC
- [ ] Verify GPU visible inside LXC: `nvidia-smi`
- [ ] Deploy Ollama container with GPU passthrough
- [ ] Deploy Open WebUI container
- [ ] Configure Caddy reverse proxy
- [ ] Test end-to-end: VPN on, browser to internal URL, model inference working
---
## Phase 5: IaC + Automation
> **GATE — Full stack deployed and verified functional.**
- [ ] Configure Terraform Proxmox provider (`bpg/proxmox`)
- [ ] Write Terraform modules for LXC and VM templates
- [ ] Import existing LXCs into Terraform state
- [ ] Write Ansible playbooks for LXC configuration
- [ ] Deploy HashiCorp Vault LXC
- [ ] Migrate secrets from manual config to Vault
---
## Future Considerations (Not in Scope)
| Item | Notes |
|------|-------|
| UPS (APC or similar) | Worthwhile — deferred beyond Phase 5 |
| Second NIC for dedicated storage network | Optional optimisation |
| GPU upgrade beyond RTX 3060 | Not needed at current model sizes |