Architecture, overview, homelab build plan, agent handbook, ADRs, and agent operating rules. All sensitive operational details sanitized (real IPs, hostnames, client names replaced with generic placeholders). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
244 lines
7 KiB
Markdown
244 lines
7 KiB
Markdown
# Home Lab Build Plan — HP Z640
|
||
|
||
## Hardware
|
||
|
||
| Component | Detail |
|
||
|---|---|
|
||
| **System** | HP Z640 Workstation |
|
||
| **CPU** | Intel Xeon (workstation class) |
|
||
| **RAM** | 64 GB ECC |
|
||
| **OS Storage** | 2× Samsung 850 EVO 500 GB — ZFS mirror (`rpool`) |
|
||
| **Data Storage** | 4× Seagate 2 TB — ZFS RAIDZ2 encrypted (data pool) |
|
||
| **GPU 1** | Intel Arc A310 (Sparkle) 4 GB — Jellyfin VA-API transcoding |
|
||
| **GPU 2** | EVGA GeForce RTX 3060 XC 12 GB GDDR6 — Ollama local LLM inference |
|
||
| **Current state** | Proxmox VE installed, organic/messy config — scheduled for clean rebuild |
|
||
|
||
---
|
||
|
||
## Phase Overview
|
||
|
||
```mermaid
|
||
gantt
|
||
title HP Z640 Rebuild — Phase Sequence
|
||
dateFormat YYYY-MM-DD
|
||
axisFormat Phase
|
||
|
||
section Prerequisite
|
||
Phase 0a — Pre-audit (SSH) :crit, p0a, 2026-04-01, 1d
|
||
Phase 0b — USB backup :crit, p0b, after p0a, 1d
|
||
|
||
section Core Build
|
||
Phase 1 — Proxmox clean install :crit, p1, after p0b, 2d
|
||
Phase 2 — Core infrastructure LXCs :p2, after p1, 2d
|
||
|
||
section Services
|
||
Phase 3 — Media stack :p3, after p2, 2d
|
||
Phase 4a — Networking + security :p4a, after p3, 1d
|
||
Phase 4b — Agent stack (trillian) :p4b, after p4a, 2d
|
||
|
||
section Automation
|
||
Phase 5 — IaC + automation :p5, after p4b, 3d
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 0a: Pre-Audit
|
||
|
||
> **GATE — Nothing proceeds until this is complete.**
|
||
|
||
Capture the current state of the Z640 before any destructive action. The rebuild will wipe LXC and VM configuration.
|
||
|
||
**Scope:**
|
||
- [ ] ZFS pool layout (`rpool` mirror + data pool RAIDZ2) — names, health, encryption status
|
||
- [ ] VM and LXC inventory — all IDs, names, disk sizes, network config
|
||
- [ ] Arr stack config and data paths (Sonarr, Radarr, Prowlarr, etc.)
|
||
- [ ] Jellyfin config path and media library paths
|
||
- [ ] PBS datastore paths and retention config
|
||
- [ ] Network config — bridges, VLANs, IP assignments
|
||
- [ ] Cron jobs — all scheduled tasks
|
||
- [ ] Running services summary
|
||
|
||
---
|
||
|
||
## Phase 0b: USB Backup
|
||
|
||
> **GATE — USB backup must complete before Phase 1. No exceptions.**
|
||
|
||
Full backup of the ZFS data pool to external USB before any rebuild touches storage.
|
||
|
||
- [ ] Attach external USB drive to Z640
|
||
- [ ] Verify USB drive capacity (must exceed used space on data pool)
|
||
- [ ] Export pool snapshot and send to USB
|
||
|
||
```bash
|
||
# Capture used space first
|
||
zpool list
|
||
zfs list
|
||
|
||
# Send encrypted data pool to USB (adjust pool/dataset names from audit output)
|
||
zfs snapshot datapool@pre-rebuild
|
||
zfs send -R datapool@pre-rebuild | pv > /mnt/usb/datapool-pre-rebuild.zfs
|
||
|
||
# Verify send completed without error
|
||
echo "Exit code: $?"
|
||
```
|
||
|
||
---
|
||
|
||
## Phase 1: Proxmox Clean Install
|
||
|
||
> **GATE — Phase 0a audit complete. Phase 0b USB backup verified.**
|
||
|
||
Fresh Proxmox VE install. Import existing ZFS pools. Establish baseline network config.
|
||
|
||
- [ ] Download latest stable Proxmox VE ISO
|
||
- [ ] Write ISO to USB installer
|
||
- [ ] Boot Z640 from installer USB
|
||
- [ ] Install Proxmox VE — **do not touch the data pool disks**
|
||
- [ ] Import data pool:
|
||
|
||
```bash
|
||
zpool import -f datapool
|
||
zfs load-key datapool
|
||
zfs mount -a
|
||
```
|
||
|
||
- [ ] Verify pool health: `zpool status && zfs list`
|
||
|
||
### Network Config
|
||
|
||
VLAN scheme: `10.42.0.0/16` supernet. VLAN ID = third octet of the subnet.
|
||
|
||
| VLAN ID | Subnet | Purpose |
|
||
|---|---|---|
|
||
| 10 | 10.42.10.0/24 | Management |
|
||
| 20 | 10.42.20.0/24 | LAN / trusted devices |
|
||
| 60 | 10.42.60.0/24 | AI-Agents |
|
||
|
||
---
|
||
|
||
## Phase 2: Core Infrastructure LXCs
|
||
|
||
> **GATE — Proxmox clean install complete. ZFS pools healthy.**
|
||
|
||
### 2a — PBS LXC (Proxmox Backup Server)
|
||
|
||
- [ ] Create LXC for PBS (unprivileged, Debian base)
|
||
- [ ] Assign a datastore path on the data pool
|
||
- [ ] Configure PBS retention policy
|
||
- [ ] Register PBS in Proxmox
|
||
- [ ] Test backup of a throwaway LXC
|
||
|
||
### 2b — WireGuard LXC
|
||
|
||
- [ ] Create LXC for WireGuard
|
||
- [ ] Install WireGuard
|
||
- [ ] Configure as spoke to CHR01
|
||
|
||
### 2c — Monitoring LXC
|
||
|
||
- [ ] Create LXC for monitoring stack
|
||
- [ ] Install Prometheus + Grafana
|
||
- [ ] Add Proxmox node as scrape target
|
||
- [ ] Basic dashboard: CPU, RAM, ZFS pool health, network
|
||
|
||
---
|
||
|
||
## Phase 3: Media Stack
|
||
|
||
> **GATE — Phase 2 complete. ZFS data pool mounted and healthy.**
|
||
|
||
### 3a — Jellyfin LXC with Intel Arc A310
|
||
|
||
- [ ] Create LXC (privileged — required for GPU passthrough)
|
||
- [ ] Pass through Intel Arc A310 via IOMMU / device passthrough
|
||
- [ ] Install Jellyfin
|
||
- [ ] Bind-mount media library paths from ZFS data pool
|
||
- [ ] Configure VA-API hardware transcoding
|
||
|
||
```bash
|
||
# Verify VA-API inside LXC
|
||
vainfo
|
||
# Expected: shows Intel iHD driver, H264/HEVC encode/decode profiles
|
||
```
|
||
|
||
### 3b — Arr Stack LXCs or Docker
|
||
|
||
- [ ] Determine migration target: individual LXCs or single Docker LXC
|
||
- [ ] Restore arr config from paths captured in audit
|
||
- [ ] Verify indexer connectivity (Prowlarr)
|
||
- [ ] Verify download client connectivity
|
||
- [ ] Verify library scan in Sonarr/Radarr against restored media paths
|
||
|
||
---
|
||
|
||
## Phase 4a: Networking + Security
|
||
|
||
> **GATE — Media stack verified functional.**
|
||
|
||
- [ ] All LXCs assigned to correct VLANs
|
||
- [ ] Proxmox firewall rules: deny inter-VLAN by default, permit explicitly
|
||
- [ ] VLAN 60 (AI-Agents) isolated — only permitted outbound: DNS, HTTPS, WireGuard tunnel
|
||
- [ ] WireGuard tunnel to CHR01 confirmed up and passing traffic
|
||
|
||
---
|
||
|
||
## Phase 4b: Agent Stack — Open WebUI (LXC: `trillian`, VMID 112, VLAN 60)
|
||
|
||
> **GATE — Phase 4a network config complete. VLAN 60 operational.**
|
||
|
||
**Goal:** Deploy Open WebUI backed by Ollama on the RTX 3060.
|
||
|
||
### Architecture
|
||
|
||
```mermaid
|
||
flowchart TD
|
||
User["User (VPN connected)"]
|
||
VPS01["VPS01\nCaddy reverse proxy\ntherapon.yourdomain.com"]
|
||
WG["WireGuard tunnel\nCHR01 ↔ trillian"]
|
||
Caddy["Caddy (trillian LXC)\nInternal reverse proxy"]
|
||
WebUI["Open WebUI\nDocker container"]
|
||
Ollama["Ollama\nDocker container"]
|
||
GPU["RTX 3060 XC 12 GB\nGPU passthrough"]
|
||
|
||
User --> VPS01
|
||
VPS01 --> WG
|
||
WG --> Caddy
|
||
Caddy --> WebUI
|
||
WebUI --> Ollama
|
||
Ollama --> GPU
|
||
```
|
||
|
||
### Tasks
|
||
|
||
- [ ] Create privileged LXC `trillian` — VMID 112, VLAN 60, Debian 12
|
||
- [ ] Pass through EVGA RTX 3060 via IOMMU
|
||
- [ ] Install Docker inside LXC
|
||
- [ ] Verify GPU visible inside LXC: `nvidia-smi`
|
||
- [ ] Deploy Ollama container with GPU passthrough
|
||
- [ ] Deploy Open WebUI container
|
||
- [ ] Configure Caddy reverse proxy
|
||
- [ ] Test end-to-end: VPN on, browser to internal URL, model inference working
|
||
|
||
---
|
||
|
||
## Phase 5: IaC + Automation
|
||
|
||
> **GATE — Full stack deployed and verified functional.**
|
||
|
||
- [ ] Configure Terraform Proxmox provider (`bpg/proxmox`)
|
||
- [ ] Write Terraform modules for LXC and VM templates
|
||
- [ ] Import existing LXCs into Terraform state
|
||
- [ ] Write Ansible playbooks for LXC configuration
|
||
- [ ] Deploy HashiCorp Vault LXC
|
||
- [ ] Migrate secrets from manual config to Vault
|
||
|
||
---
|
||
|
||
## Future Considerations (Not in Scope)
|
||
|
||
| Item | Notes |
|
||
|------|-------|
|
||
| UPS (APC or similar) | Worthwhile — deferred beyond Phase 5 |
|
||
| Second NIC for dedicated storage network | Optional optimisation |
|
||
| GPU upgrade beyond RTX 3060 | Not needed at current model sizes |
|