portfolio/agentlab/architecture.md
AgentLab d5ef629a54 feat: initial AgentLab portfolio content
Architecture, overview, homelab build plan, agent handbook, ADRs,
and agent operating rules. All sensitive operational details sanitized
(real IPs, hostnames, client names replaced with generic placeholders).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 04:52:42 +00:00

360 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# AgentLab System Architecture
> **Scope:** Full system map — hardware, networking, AI stack, three-lane operating model,
> and open architectural decisions.
---
## Contents
- [Hardware Inventory](#hardware-inventory)
- [Three-Lane Operating Model](#three-lane-operating-model)
- [Network Topology](#network-topology)
- [VLAN Scheme](#vlan-scheme)
- [VPS Platform Services](#vps-platform-services)
- [AI Stack](#ai-stack)
- [AgentLab Orchestrator](#agentlab-orchestrator)
- [Chat and Voice Access](#chat-and-voice-access)
- [Backup and Recovery](#backup-and-recovery)
- [Open Architectural Decisions](#open-architectural-decisions)
---
## Hardware Inventory
| Node | Type | Status | Role |
|------|------|--------|------|
| MacBook Pro M5 24 GB | Operator workstation | Active | Primary operator machine, Claude Code devcontainer, local Ollama (privacy), portable to client sites |
| HP Z640 64 GB RAM | Home server | Needs rebuild | Proxmox VE host — agent stack, media server, persistent Ollama inference (RTX 3060) |
| Intel NUC (home office) | Small server | Active | Proxmox Backup Server, VPS recovery testing lab |
| CHR01 | MikroTik Cloud Hosted Router | Active | WireGuard hub (all home lab spokes), SSTP hub for client MikroTik routers |
| CHR02 | MikroTik Cloud Hosted Router | Planned | Failover hub — design undefined |
| VPS01 | VPS | Active | Forgejo, Caddy reverse proxy, monitoring stack, WireGuard spoke |
| VPS02 | VPS | Partial | Warm standby for VPS01 — manual failover only, sync method undefined |
| Client routers | MikroTik RouterOS v6 | Active | SSTP tunnels to CHR01 ([MSP Client A] — multiple sites) |
| Android tablet | Mobile | Phase Later | Dashboard, demo device |
### MacBook Pro M5 — Workstation Baseline
| Item | Value |
|------|-------|
| Chip | Apple M5 |
| CPU cores | 10 total — 4 performance, 6 efficiency |
| Unified memory | 24 GB |
### HP Z640 — GPU Inventory
| GPU | VRAM | Role |
|-----|------|------|
| Intel Arc A310 (Sparkle) | 4 GB | Jellyfin VA-API transcoding |
| EVGA GeForce RTX 3060 XC 12 GB GDDR6 | 12 GB | Ollama local LLM inference |
### Intel NUC — Home Office Node
| Item | Value |
|------|-------|
| CPU | Intel i7 10th gen (~10 cores) |
| RAM | 8 GB (upgradeable) |
| Storage | 500 GB SSD + 2× 2 TB SSDs |
| Role | Proxmox Backup Server + VPS recovery lab |
### Proxmox Node Naming (Hitchhiker's Guide Theme)
Existing nodes: `deep-thought`, `ford`, `marvin`, `zaphod`, `slartibartfast`, `hactar`, `magrathea`
Planned: `trillian` (VMID 112) — Open WebUI + Ollama LXC on VLAN 60
---
## Three-Lane Operating Model
All three lanes require access to a local model. Sensitive or private data must never leave the device.
| Lane | Purpose | Primary devices | Frontier API allowed | Local model requirement |
|------|---------|----------------|----------------------|------------------------|
| **Enterprise / MSP** | ScreamSaver IT + AI-native MSP brand. Client data cleanup, CSV processing, password review, business docs | Mac + Z640 | Claude (Anthropic API) | Required — sensitive client data |
| **Personal projects** | Creator work | Mac + Z640 | Claude + Gemini | Required — privacy |
| **Private / personal** | Private inquiries, financial, health | Mac + Z640 | None | Hard requirement — no frontier API |
**Practical implementation:** Open WebUI (self-hosted) with per-conversation model switching.
Local Ollama serves the private lane. Claude API serves enterprise and personal project lanes.
```mermaid
graph TD
subgraph Lanes["Three-Lane Operating Model"]
direction TB
L1["Enterprise / MSP Lane\nClient data · CSV · business docs"]
L2["Personal Projects Lane\nCreator work · experiments"]
L3["Private / Personal Lane\nFinancial · health · personal"]
end
subgraph Models["Model Routing"]
FM1["Claude API\n(Anthropic Console)"]
FM2["Gemini API"]
LM["Local Ollama\nRTX 3060 / Mac Apple Silicon"]
end
L1 -->|"Allowed"| FM1
L2 -->|"Allowed"| FM1
L2 -->|"Allowed"| FM2
L3 -->|"Hard block — no frontier"| LM
L1 -->|"Required"| LM
L2 -->|"Required"| LM
style L3 fill:#7f1d1d,color:#fca5a5
style LM fill:#1e3a5f,color:#93c5fd
```
> **Open decision — OL-THREE-LANE-FORMALISE:** The three-lane model is a policy
> concept. It is not yet technically enforced. Routing, retention boundaries, and migration
> to permanent hardware are still undefined.
---
## Network Topology
### WireGuard Hub-and-Spoke
**Hub:** CHR01 (MikroTik Cloud Hosted Router)
**Spokes:** VPS01, Mac workstation, Z640 (VLAN 60), NUC, client office connections
**SSTP tunnels:** Client MikroTik routers (RouterOS v6) connect via SSTP to CHR01.
This enables Winbox/SSH access to client routers and WireGuard-forwarded connectivity
for client site access (e.g. POS RDP paths).
```mermaid
graph TB
subgraph Cloud["Cloud / VPS Layer"]
CHR01["CHR01\nMikroTik Cloud Hosted Router\nWireGuard hub · SSTP hub"]
VPS01["VPS01\nvps01.yourdomain.com\nForgejo · Caddy · Monitoring\nwg0: 10.0.12.20\nwg1: 10.33.33.1"]
CHR02["CHR02\n(Planned failover hub)"]
end
subgraph HomeLab["Home Lab"]
Z640["HP Z640\nProxmox VE\nOllama · RTX 3060\nVLAN 60 / 10.42.60.x"]
NUC["Intel NUC\nProxmox Backup Server\nVPS recovery lab"]
end
subgraph Operator["Operator"]
MAC["MacBook Pro M5\nClaude Code devcontainer\nLocal Ollama (privacy)"]
end
subgraph Clients["Client Sites (SSTP)"]
CR1["[MSP Client A]\nmultiple sites\nMikroTik RouterOS v6"]
CRN["Other client routers\n(MikroTik RouterOS v6)"]
end
CHR01 <-->|"WireGuard spoke"| VPS01
CHR01 <-->|"WireGuard spoke"| Z640
CHR01 <-->|"WireGuard spoke"| NUC
CHR01 <-->|"WireGuard spoke"| MAC
CHR01 <-->|"SSTP tunnel"| CR1
CHR01 <-->|"SSTP tunnel"| CRN
CHR01 -.->|"Planned failover"| CHR02
style CHR01 fill:#1a3a2a,color:#86efac
style CHR02 fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
style VPS01 fill:#1e3a5f,color:#93c5fd
```
> **Open decision — OL-CHR-FAILOVER:** CHR01 is the single WireGuard and SSTP hub.
> Three failover options under consideration:
>
> 1. Keep CHR01 + add CHR02 failover (same cloud provider risk remains)
> 2. Move all tunnels to VPS01 (VPS01 becomes single point of failure for both services and tunnels)
> 3. **Split roles:** CHR01 handles client SSTP, VPS01 handles home lab WireGuard
### VPS Dual-Plane WireGuard Design
VPS01 runs two WireGuard interfaces:
| Interface | Subnet | Purpose |
|-----------|--------|---------|
| `wg0` | 10.0.12.x | Infrastructure lane — router reachability, SNMP polling, inter-site monitoring |
| `wg1` | 10.33.33.x | Operator/client access — private access to platform services, internal DNS |
Split DNS via Unbound on `wg1`: all `*.yourdomain.com` resolves over WireGuard.
---
## VLAN Scheme
**Supernet:** `10.42.0.0/16`
**Convention:** VLAN ID = third octet. VLAN 60 = `10.42.60.0/24`.
| VLAN | Name | Subnet | Key hosts |
|------|------|--------|-----------|
| 60 | AI-Agents | 10.42.60.0/24 | `trillian` (VMID 112) — Open WebUI + Ollama LXC |
---
## VPS Platform Services
| Service | Status | Notes |
|---------|--------|-------|
| Caddy | Live | All vhosts deployed |
| WireGuard | Live | Dual-plane: wg0 infrastructure, wg1 operator/client |
| Forgejo | Live | `git.yourdomain.com` |
| Prometheus | Live | Scraping node + SNMP targets |
| Grafana | Live | `monitoring.yourdomain.com` |
| Alertmanager | Live | `alerts.yourdomain.com` — email pipeline confirmed (DKIM/SPF/DMARC pass) |
| snmp_exporter | Live | MikroTik hAP ax³ active target |
| node_exporter | Live | Native — VPS host metrics |
| Loki | Live | VPS01, 30-day retention |
| Grafana Alloy | Live | Shipping systemd journal + Docker logs to Loki |
| Restic | Partial | Backup targets exist — restore validation not yet complete |
| Business control plane | Direction reset | CRM, ticketing, PM selection reopened |
| FreeRADIUS captive portal | Not started | Phase 5 |
| FastAPI alert → ticket bridge | Not started | Depends on control-plane API design |
---
## AI Stack
```mermaid
graph TD
subgraph Users["Operator Access"]
MAC_UI["Mac — browser tab\nor home screen web app"]
PHONE["iPhone — WireGuard VPN\nbrowser shortcut"]
TABLET["Android tablet\n(Phase Later)"]
end
subgraph OpenWebUI["Open WebUI (trillian / VLAN 60)"]
direction TB
OW["Open WebUI\nSelf-hosted Docker container\nCaddy reverse proxy"]
end
subgraph ModelBackends["Model Backends"]
OLLAMA["Ollama\nRTX 3060 / 12 GB VRAM\nPrivate + enterprise lanes\n7B13B class models"]
CLAUDE_API["Claude API\nAnthropic Console account\nEnterprise + personal project lanes"]
MAC_OLLAMA["Mac Ollama\nApple M5 / 24 GB unified\nPortable private inference"]
end
MAC_UI --> OW
PHONE --> OW
TABLET -.-> OW
OW --> OLLAMA
OW --> CLAUDE_API
OW -.-> MAC_OLLAMA
style OW fill:#1a3a2a,color:#86efac
style OLLAMA fill:#1e3a5f,color:#93c5fd
style CLAUDE_API fill:#3b1f6e,color:#c4b5fd
style MAC_OLLAMA fill:#3f2a10,color:#fdba74
```
### Model Routing Summary
| Use case | Model | Lane |
|----------|-------|------|
| Client data processing | Claude API | Enterprise |
| Business documentation | Claude API | Enterprise |
| Personal project work | Claude API or Gemini API | Personal projects |
| Private/personal queries | Local Ollama only | Private |
| Financial or health queries | Local Ollama only | Private — hard requirement |
---
## AgentLab Orchestrator
The orchestrator is a separate coordination layer above Open WebUI. It routes work between
Claude Code, Codex CLI, and Gemini CLI and manages the multi-agent session substrate.
> **Current status:** Orchestrator Phase 1 substantially complete. Shelved for full deployment until Z640 rebuild is complete.
```mermaid
graph TD
subgraph Orchestrator["AgentLab Orchestrator (Therapon)"]
SUPER["Supervisor\nClaude — agent branch\norchestrator.py"]
PLAN["Planner\nTask decomposition"]
WORK_C["Worker — Claude Code"]
WORK_X["Worker — Codex CLI\ncodex branch / worktree"]
WORK_G["Worker — Gemini CLI\ngemini branch / worktree"]
RESEARCH["Researcher\n(active)"]
VERIFY["Verifier\n(stub — not wired)"]
PRIV["Private-lane worker\n(stub — pending)"]
end
OP["Operator\niTerm2 + orchestrator profile"]
OP -->|"Prompt via orchestrator profile"| SUPER
SUPER --> PLAN
PLAN --> WORK_C
PLAN --> WORK_X
PLAN --> WORK_G
PLAN --> RESEARCH
PLAN -.->|"pending"| VERIFY
PLAN -.->|"pending"| PRIV
style SUPER fill:#1a3a2a,color:#86efac
style VERIFY fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
style PRIV fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
```
### Multi-Agent Git Model
| Branch | Owner | Purpose |
|--------|-------|---------|
| `main` | Human only | Production — never commit directly |
| `agent` | Claude (Supervisor) | Claude's working branch |
| `codex` | Codex CLI | Codex's working branch |
| `gemini` | Gemini CLI | Gemini's working branch |
Human promotes `agent``main` via `./tools/promote.sh <tag>` after review.
---
## Chat and Voice Access
| Device | Interface | Notes |
|--------|-----------|-------|
| Mac | Open WebUI — browser tab or home screen web app | Primary |
| iPhone | Open WebUI — browser shortcut, WireGuard VPN on | VPN required |
| Android tablet | Open WebUI — Phase Later | Planned |
### Voice Input
| Device | Tool | Status |
|--------|------|--------|
| Mac | SuperWhisper — local Parakeet model via WhisperKit, Apple Neural Engine | Active — all audio on-device |
| iPhone | Needs research — Apple Dictate is insufficient | Open |
---
## Backup and Recovery
| Layer | Target | Status |
|-------|--------|--------|
| Local Restic | VPS disk — fast-restore cache | Exists — restore validation not complete |
| Offsite | Cloudflare R2 Standard | Active — intended primary DR copy |
| Third target (3-2-1 completion) | Home lab (Z640/NUC) | Not active — blocked by Z640 rebuild |
> **True 3-2-1 is not yet complete.** R2 is the current disaster-recovery copy.
> Restore path has not been validated end-to-end.
---
## Open Architectural Decisions
| Topic | Summary | Status |
|-------|---------|--------|
| WireGuard hub failover | CHR01 is a single hub. Three options (CHR02 same-provider, move to VPS01, split roles). No decision made. | Open |
| VPS02 warm standby | Manual failover only. Sync method undefined. | Open |
| Three-lane model enforcement | Policy concept — not technically enforced. Routing and retention boundaries undefined. | Open |
| Business control-plane architecture | CRM, ticketing, PM selection reopened. Odoo not a settled answer. GLPI in scope for ticketing. Target: dashboard over multiple best-fit tools, not a monolithic app. | Open |
| Cross-agent verification loop | One agent answers → second independently verifies → discrepancies surface before operator acts. | Not started |
---
## Security Boundaries
| Boundary | Mechanism |
|----------|-----------|
| All external service access | WireGuard VPN required — no public-facing admin interfaces |
| Secrets management | Ansible vault — never in git |
| Agent execution boundary | No agent controls production execution directly — human Terminal gate for all Ansible runs |
| Private lane data | Never routed to frontier APIs — local Ollama only |
| SSH keys | Mac host only — not mounted in container |