portfolio/agentlab/architecture.md
AgentLab d5ef629a54 feat: initial AgentLab portfolio content
Architecture, overview, homelab build plan, agent handbook, ADRs,
and agent operating rules. All sensitive operational details sanitized
(real IPs, hostnames, client names replaced with generic placeholders).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-31 04:52:42 +00:00

13 KiB
Raw Blame History

AgentLab System Architecture

Scope: Full system map — hardware, networking, AI stack, three-lane operating model, and open architectural decisions.


Contents


Hardware Inventory

Node Type Status Role
MacBook Pro M5 24 GB Operator workstation Active Primary operator machine, Claude Code devcontainer, local Ollama (privacy), portable to client sites
HP Z640 64 GB RAM Home server Needs rebuild Proxmox VE host — agent stack, media server, persistent Ollama inference (RTX 3060)
Intel NUC (home office) Small server Active Proxmox Backup Server, VPS recovery testing lab
CHR01 MikroTik Cloud Hosted Router Active WireGuard hub (all home lab spokes), SSTP hub for client MikroTik routers
CHR02 MikroTik Cloud Hosted Router Planned Failover hub — design undefined
VPS01 VPS Active Forgejo, Caddy reverse proxy, monitoring stack, WireGuard spoke
VPS02 VPS Partial Warm standby for VPS01 — manual failover only, sync method undefined
Client routers MikroTik RouterOS v6 Active SSTP tunnels to CHR01 ([MSP Client A] — multiple sites)
Android tablet Mobile Phase Later Dashboard, demo device

MacBook Pro M5 — Workstation Baseline

Item Value
Chip Apple M5
CPU cores 10 total — 4 performance, 6 efficiency
Unified memory 24 GB

HP Z640 — GPU Inventory

GPU VRAM Role
Intel Arc A310 (Sparkle) 4 GB Jellyfin VA-API transcoding
EVGA GeForce RTX 3060 XC 12 GB GDDR6 12 GB Ollama local LLM inference

Intel NUC — Home Office Node

Item Value
CPU Intel i7 10th gen (~10 cores)
RAM 8 GB (upgradeable)
Storage 500 GB SSD + 2× 2 TB SSDs
Role Proxmox Backup Server + VPS recovery lab

Proxmox Node Naming (Hitchhiker's Guide Theme)

Existing nodes: deep-thought, ford, marvin, zaphod, slartibartfast, hactar, magrathea

Planned: trillian (VMID 112) — Open WebUI + Ollama LXC on VLAN 60


Three-Lane Operating Model

All three lanes require access to a local model. Sensitive or private data must never leave the device.

Lane Purpose Primary devices Frontier API allowed Local model requirement
Enterprise / MSP ScreamSaver IT + AI-native MSP brand. Client data cleanup, CSV processing, password review, business docs Mac + Z640 Claude (Anthropic API) Required — sensitive client data
Personal projects Creator work Mac + Z640 Claude + Gemini Required — privacy
Private / personal Private inquiries, financial, health Mac + Z640 None Hard requirement — no frontier API

Practical implementation: Open WebUI (self-hosted) with per-conversation model switching. Local Ollama serves the private lane. Claude API serves enterprise and personal project lanes.

graph TD
    subgraph Lanes["Three-Lane Operating Model"]
        direction TB
        L1["Enterprise / MSP Lane\nClient data · CSV · business docs"]
        L2["Personal Projects Lane\nCreator work · experiments"]
        L3["Private / Personal Lane\nFinancial · health · personal"]
    end

    subgraph Models["Model Routing"]
        FM1["Claude API\n(Anthropic Console)"]
        FM2["Gemini API"]
        LM["Local Ollama\nRTX 3060 / Mac Apple Silicon"]
    end

    L1 -->|"Allowed"| FM1
    L2 -->|"Allowed"| FM1
    L2 -->|"Allowed"| FM2
    L3 -->|"Hard block — no frontier"| LM
    L1 -->|"Required"| LM
    L2 -->|"Required"| LM

    style L3 fill:#7f1d1d,color:#fca5a5
    style LM fill:#1e3a5f,color:#93c5fd

Open decision — OL-THREE-LANE-FORMALISE: The three-lane model is a policy concept. It is not yet technically enforced. Routing, retention boundaries, and migration to permanent hardware are still undefined.


Network Topology

WireGuard Hub-and-Spoke

Hub: CHR01 (MikroTik Cloud Hosted Router)

Spokes: VPS01, Mac workstation, Z640 (VLAN 60), NUC, client office connections

SSTP tunnels: Client MikroTik routers (RouterOS v6) connect via SSTP to CHR01. This enables Winbox/SSH access to client routers and WireGuard-forwarded connectivity for client site access (e.g. POS RDP paths).

graph TB
    subgraph Cloud["Cloud / VPS Layer"]
        CHR01["CHR01\nMikroTik Cloud Hosted Router\nWireGuard hub · SSTP hub"]
        VPS01["VPS01\nvps01.yourdomain.com\nForgejo · Caddy · Monitoring\nwg0: 10.0.12.20\nwg1: 10.33.33.1"]
        CHR02["CHR02\n(Planned failover hub)"]
    end

    subgraph HomeLab["Home Lab"]
        Z640["HP Z640\nProxmox VE\nOllama · RTX 3060\nVLAN 60 / 10.42.60.x"]
        NUC["Intel NUC\nProxmox Backup Server\nVPS recovery lab"]
    end

    subgraph Operator["Operator"]
        MAC["MacBook Pro M5\nClaude Code devcontainer\nLocal Ollama (privacy)"]
    end

    subgraph Clients["Client Sites (SSTP)"]
        CR1["[MSP Client A]\nmultiple sites\nMikroTik RouterOS v6"]
        CRN["Other client routers\n(MikroTik RouterOS v6)"]
    end

    CHR01 <-->|"WireGuard spoke"| VPS01
    CHR01 <-->|"WireGuard spoke"| Z640
    CHR01 <-->|"WireGuard spoke"| NUC
    CHR01 <-->|"WireGuard spoke"| MAC
    CHR01 <-->|"SSTP tunnel"| CR1
    CHR01 <-->|"SSTP tunnel"| CRN
    CHR01 -.->|"Planned failover"| CHR02

    style CHR01 fill:#1a3a2a,color:#86efac
    style CHR02 fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
    style VPS01 fill:#1e3a5f,color:#93c5fd

Open decision — OL-CHR-FAILOVER: CHR01 is the single WireGuard and SSTP hub. Three failover options under consideration:

  1. Keep CHR01 + add CHR02 failover (same cloud provider risk remains)
  2. Move all tunnels to VPS01 (VPS01 becomes single point of failure for both services and tunnels)
  3. Split roles: CHR01 handles client SSTP, VPS01 handles home lab WireGuard

VPS Dual-Plane WireGuard Design

VPS01 runs two WireGuard interfaces:

Interface Subnet Purpose
wg0 10.0.12.x Infrastructure lane — router reachability, SNMP polling, inter-site monitoring
wg1 10.33.33.x Operator/client access — private access to platform services, internal DNS

Split DNS via Unbound on wg1: all *.yourdomain.com resolves over WireGuard.


VLAN Scheme

Supernet: 10.42.0.0/16

Convention: VLAN ID = third octet. VLAN 60 = 10.42.60.0/24.

VLAN Name Subnet Key hosts
60 AI-Agents 10.42.60.0/24 trillian (VMID 112) — Open WebUI + Ollama LXC

VPS Platform Services

Service Status Notes
Caddy Live All vhosts deployed
WireGuard Live Dual-plane: wg0 infrastructure, wg1 operator/client
Forgejo Live git.yourdomain.com
Prometheus Live Scraping node + SNMP targets
Grafana Live monitoring.yourdomain.com
Alertmanager Live alerts.yourdomain.com — email pipeline confirmed (DKIM/SPF/DMARC pass)
snmp_exporter Live MikroTik hAP ax³ active target
node_exporter Live Native — VPS host metrics
Loki Live VPS01, 30-day retention
Grafana Alloy Live Shipping systemd journal + Docker logs to Loki
Restic Partial Backup targets exist — restore validation not yet complete
Business control plane Direction reset CRM, ticketing, PM selection reopened
FreeRADIUS captive portal Not started Phase 5
FastAPI alert → ticket bridge Not started Depends on control-plane API design

AI Stack

graph TD
    subgraph Users["Operator Access"]
        MAC_UI["Mac — browser tab\nor home screen web app"]
        PHONE["iPhone — WireGuard VPN\nbrowser shortcut"]
        TABLET["Android tablet\n(Phase Later)"]
    end

    subgraph OpenWebUI["Open WebUI (trillian / VLAN 60)"]
        direction TB
        OW["Open WebUI\nSelf-hosted Docker container\nCaddy reverse proxy"]
    end

    subgraph ModelBackends["Model Backends"]
        OLLAMA["Ollama\nRTX 3060 / 12 GB VRAM\nPrivate + enterprise lanes\n7B13B class models"]
        CLAUDE_API["Claude API\nAnthropic Console account\nEnterprise + personal project lanes"]
        MAC_OLLAMA["Mac Ollama\nApple M5 / 24 GB unified\nPortable private inference"]
    end

    MAC_UI --> OW
    PHONE --> OW
    TABLET -.-> OW

    OW --> OLLAMA
    OW --> CLAUDE_API
    OW -.-> MAC_OLLAMA

    style OW fill:#1a3a2a,color:#86efac
    style OLLAMA fill:#1e3a5f,color:#93c5fd
    style CLAUDE_API fill:#3b1f6e,color:#c4b5fd
    style MAC_OLLAMA fill:#3f2a10,color:#fdba74

Model Routing Summary

Use case Model Lane
Client data processing Claude API Enterprise
Business documentation Claude API Enterprise
Personal project work Claude API or Gemini API Personal projects
Private/personal queries Local Ollama only Private
Financial or health queries Local Ollama only Private — hard requirement

AgentLab Orchestrator

The orchestrator is a separate coordination layer above Open WebUI. It routes work between Claude Code, Codex CLI, and Gemini CLI and manages the multi-agent session substrate.

Current status: Orchestrator Phase 1 substantially complete. Shelved for full deployment until Z640 rebuild is complete.

graph TD
    subgraph Orchestrator["AgentLab Orchestrator (Therapon)"]
        SUPER["Supervisor\nClaude — agent branch\norchestrator.py"]
        PLAN["Planner\nTask decomposition"]
        WORK_C["Worker — Claude Code"]
        WORK_X["Worker — Codex CLI\ncodex branch / worktree"]
        WORK_G["Worker — Gemini CLI\ngemini branch / worktree"]
        RESEARCH["Researcher\n(active)"]
        VERIFY["Verifier\n(stub — not wired)"]
        PRIV["Private-lane worker\n(stub — pending)"]
    end

    OP["Operator\niTerm2 + orchestrator profile"]

    OP -->|"Prompt via orchestrator profile"| SUPER
    SUPER --> PLAN
    PLAN --> WORK_C
    PLAN --> WORK_X
    PLAN --> WORK_G
    PLAN --> RESEARCH
    PLAN -.->|"pending"| VERIFY
    PLAN -.->|"pending"| PRIV

    style SUPER fill:#1a3a2a,color:#86efac
    style VERIFY fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
    style PRIV fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5

Multi-Agent Git Model

Branch Owner Purpose
main Human only Production — never commit directly
agent Claude (Supervisor) Claude's working branch
codex Codex CLI Codex's working branch
gemini Gemini CLI Gemini's working branch

Human promotes agentmain via ./tools/promote.sh <tag> after review.


Chat and Voice Access

Device Interface Notes
Mac Open WebUI — browser tab or home screen web app Primary
iPhone Open WebUI — browser shortcut, WireGuard VPN on VPN required
Android tablet Open WebUI — Phase Later Planned

Voice Input

Device Tool Status
Mac SuperWhisper — local Parakeet model via WhisperKit, Apple Neural Engine Active — all audio on-device
iPhone Needs research — Apple Dictate is insufficient Open

Backup and Recovery

Layer Target Status
Local Restic VPS disk — fast-restore cache Exists — restore validation not complete
Offsite Cloudflare R2 Standard Active — intended primary DR copy
Third target (3-2-1 completion) Home lab (Z640/NUC) Not active — blocked by Z640 rebuild

True 3-2-1 is not yet complete. R2 is the current disaster-recovery copy. Restore path has not been validated end-to-end.


Open Architectural Decisions

Topic Summary Status
WireGuard hub failover CHR01 is a single hub. Three options (CHR02 same-provider, move to VPS01, split roles). No decision made. Open
VPS02 warm standby Manual failover only. Sync method undefined. Open
Three-lane model enforcement Policy concept — not technically enforced. Routing and retention boundaries undefined. Open
Business control-plane architecture CRM, ticketing, PM selection reopened. Odoo not a settled answer. GLPI in scope for ticketing. Target: dashboard over multiple best-fit tools, not a monolithic app. Open
Cross-agent verification loop One agent answers → second independently verifies → discrepancies surface before operator acts. Not started

Security Boundaries

Boundary Mechanism
All external service access WireGuard VPN required — no public-facing admin interfaces
Secrets management Ansible vault — never in git
Agent execution boundary No agent controls production execution directly — human Terminal gate for all Ansible runs
Private lane data Never routed to frontier APIs — local Ollama only
SSH keys Mac host only — not mounted in container