# AgentLab System Architecture

> **Scope:** Full system map — hardware, networking, AI stack, three-lane operating model,
> and open architectural decisions.

---

## Contents

- [Hardware Inventory](#hardware-inventory)
- [Three-Lane Operating Model](#three-lane-operating-model)
- [Network Topology](#network-topology)
- [VLAN Scheme](#vlan-scheme)
- [VPS Platform Services](#vps-platform-services)
- [AI Stack](#ai-stack)
- [AgentLab Orchestrator](#agentlab-orchestrator)
- [Chat and Voice Access](#chat-and-voice-access)
- [Backup and Recovery](#backup-and-recovery)
- [Open Architectural Decisions](#open-architectural-decisions)

---

## Hardware Inventory

| Node | Type | Status | Role |
|------|------|--------|------|
| MacBook Pro M5 24 GB | Operator workstation | Active | Primary operator machine, Claude Code devcontainer, local Ollama (privacy), portable to client sites |
| HP Z640 64 GB RAM | Home server | Needs rebuild | Proxmox VE host — agent stack, media server, persistent Ollama inference (RTX 3060) |
| Intel NUC (home office) | Small server | Active | Proxmox Backup Server, VPS recovery testing lab |
| CHR01 | MikroTik Cloud Hosted Router | Active | WireGuard hub (all home lab spokes), SSTP hub for client MikroTik routers |
| CHR02 | MikroTik Cloud Hosted Router | Planned | Failover hub — design undefined |
| VPS01 | VPS | Active | Forgejo, Caddy reverse proxy, monitoring stack, WireGuard spoke |
| VPS02 | VPS | Partial | Warm standby for VPS01 — manual failover only, sync method undefined |
| Client routers | MikroTik RouterOS v6 | Active | SSTP tunnels to CHR01 ([MSP Client A] — multiple sites) |
| Android tablet | Mobile | Phase Later | Dashboard, demo device |

### MacBook Pro M5 — Workstation Baseline

| Item | Value |
|------|-------|
| Chip | Apple M5 |
| CPU cores | 10 total — 4 performance, 6 efficiency |
| Unified memory | 24 GB |

### HP Z640 — GPU Inventory

| GPU | VRAM | Role |
|-----|------|------|
| Intel Arc A310 (Sparkle) | 4 GB | Jellyfin VA-API transcoding |
| EVGA GeForce RTX 3060 XC 12 GB GDDR6 | 12 GB | Ollama local LLM inference |

### Intel NUC — Home Office Node

| Item | Value |
|------|-------|
| CPU | Intel i7 10th gen (~10 cores) |
| RAM | 8 GB (upgradeable) |
| Storage | 500 GB SSD + 2× 2 TB SSDs |
| Role | Proxmox Backup Server + VPS recovery lab |

### Proxmox Node Naming (Hitchhiker's Guide Theme)

Existing nodes: `deep-thought`, `ford`, `marvin`, `zaphod`, `slartibartfast`, `hactar`, `magrathea`

Planned: `trillian` (VMID 112) — Open WebUI + Ollama LXC on VLAN 60

---

## Three-Lane Operating Model

All three lanes require access to a local model. Sensitive or private data must never leave the device.

| Lane | Purpose | Primary devices | Frontier API allowed | Local model requirement |
|------|---------|----------------|----------------------|------------------------|
| **Enterprise / MSP** | ScreamSaver IT + AI-native MSP brand. Client data cleanup, CSV processing, password review, business docs | Mac + Z640 | Claude (Anthropic API) | Required — sensitive client data |
| **Personal projects** | Creator work | Mac + Z640 | Claude + Gemini | Required — privacy |
| **Private / personal** | Private inquiries, financial, health | Mac + Z640 | None | Hard requirement — no frontier API |

**Practical implementation:** Open WebUI (self-hosted) with per-conversation model switching.
Local Ollama serves the private lane. Claude API serves enterprise and personal project lanes.

```mermaid
graph TD
    subgraph Lanes["Three-Lane Operating Model"]
        direction TB
        L1["Enterprise / MSP Lane\nClient data · CSV · business docs"]
        L2["Personal Projects Lane\nCreator work · experiments"]
        L3["Private / Personal Lane\nFinancial · health · personal"]
    end

    subgraph Models["Model Routing"]
        FM1["Claude API\n(Anthropic Console)"]
        FM2["Gemini API"]
        LM["Local Ollama\nRTX 3060 / Mac Apple Silicon"]
    end

    L1 -->|"Allowed"| FM1
    L2 -->|"Allowed"| FM1
    L2 -->|"Allowed"| FM2
    L3 -->|"Hard block — no frontier"| LM
    L1 -->|"Required"| LM
    L2 -->|"Required"| LM

    style L3 fill:#7f1d1d,color:#fca5a5
    style LM fill:#1e3a5f,color:#93c5fd
```

> **Open decision — OL-THREE-LANE-FORMALISE:** The three-lane model is a policy
> concept. It is not yet technically enforced. Routing, retention boundaries, and migration
> to permanent hardware are still undefined.

---

## Network Topology

### WireGuard Hub-and-Spoke

**Hub:** CHR01 (MikroTik Cloud Hosted Router)

**Spokes:** VPS01, Mac workstation, Z640 (VLAN 60), NUC, client office connections

**SSTP tunnels:** Client MikroTik routers (RouterOS v6) connect via SSTP to CHR01.
This enables Winbox/SSH access to client routers and WireGuard-forwarded connectivity
for client site access (e.g. POS RDP paths).

```mermaid
graph TB
    subgraph Cloud["Cloud / VPS Layer"]
        CHR01["CHR01\nMikroTik Cloud Hosted Router\nWireGuard hub · SSTP hub"]
        VPS01["VPS01\nvps01.yourdomain.com\nForgejo · Caddy · Monitoring\nwg0: 10.0.12.20\nwg1: 10.33.33.1"]
        CHR02["CHR02\n(Planned failover hub)"]
    end

    subgraph HomeLab["Home Lab"]
        Z640["HP Z640\nProxmox VE\nOllama · RTX 3060\nVLAN 60 / 10.42.60.x"]
        NUC["Intel NUC\nProxmox Backup Server\nVPS recovery lab"]
    end

    subgraph Operator["Operator"]
        MAC["MacBook Pro M5\nClaude Code devcontainer\nLocal Ollama (privacy)"]
    end

    subgraph Clients["Client Sites (SSTP)"]
        CR1["[MSP Client A]\nmultiple sites\nMikroTik RouterOS v6"]
        CRN["Other client routers\n(MikroTik RouterOS v6)"]
    end

    CHR01 <-->|"WireGuard spoke"| VPS01
    CHR01 <-->|"WireGuard spoke"| Z640
    CHR01 <-->|"WireGuard spoke"| NUC
    CHR01 <-->|"WireGuard spoke"| MAC
    CHR01 <-->|"SSTP tunnel"| CR1
    CHR01 <-->|"SSTP tunnel"| CRN
    CHR01 -.->|"Planned failover"| CHR02

    style CHR01 fill:#1a3a2a,color:#86efac
    style CHR02 fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
    style VPS01 fill:#1e3a5f,color:#93c5fd
```

> **Open decision — OL-CHR-FAILOVER:** CHR01 is the single WireGuard and SSTP hub.
> Three failover options under consideration:
>
> 1. Keep CHR01 + add CHR02 failover (same cloud provider risk remains)
> 2. Move all tunnels to VPS01 (VPS01 becomes single point of failure for both services and tunnels)
> 3. **Split roles:** CHR01 handles client SSTP, VPS01 handles home lab WireGuard

### VPS Dual-Plane WireGuard Design

VPS01 runs two WireGuard interfaces:

| Interface | Subnet | Purpose |
|-----------|--------|---------|
| `wg0` | 10.0.12.x | Infrastructure lane — router reachability, SNMP polling, inter-site monitoring |
| `wg1` | 10.33.33.x | Operator/client access — private access to platform services, internal DNS |

Split DNS via Unbound on `wg1`: all `*.yourdomain.com` resolves over WireGuard.

---

## VLAN Scheme

**Supernet:** `10.42.0.0/16`

**Convention:** VLAN ID = third octet. VLAN 60 = `10.42.60.0/24`.

| VLAN | Name | Subnet | Key hosts |
|------|------|--------|-----------|
| 60 | AI-Agents | 10.42.60.0/24 | `trillian` (VMID 112) — Open WebUI + Ollama LXC |

---

## VPS Platform Services

| Service | Status | Notes |
|---------|--------|-------|
| Caddy | Live | All vhosts deployed |
| WireGuard | Live | Dual-plane: wg0 infrastructure, wg1 operator/client |
| Forgejo | Live | `git.yourdomain.com` |
| Prometheus | Live | Scraping node + SNMP targets |
| Grafana | Live | `monitoring.yourdomain.com` |
| Alertmanager | Live | `alerts.yourdomain.com` — email pipeline confirmed (DKIM/SPF/DMARC pass) |
| snmp_exporter | Live | MikroTik hAP ax³ active target |
| node_exporter | Live | Native — VPS host metrics |
| Loki | Live | VPS01, 30-day retention |
| Grafana Alloy | Live | Shipping systemd journal + Docker logs to Loki |
| Restic | Partial | Backup targets exist — restore validation not yet complete |
| Business control plane | Direction reset | CRM, ticketing, PM selection reopened |
| FreeRADIUS captive portal | Not started | Phase 5 |
| FastAPI alert → ticket bridge | Not started | Depends on control-plane API design |

---

## AI Stack

```mermaid
graph TD
    subgraph Users["Operator Access"]
        MAC_UI["Mac — browser tab\nor home screen web app"]
        PHONE["iPhone — WireGuard VPN\nbrowser shortcut"]
        TABLET["Android tablet\n(Phase Later)"]
    end

    subgraph OpenWebUI["Open WebUI (trillian / VLAN 60)"]
        direction TB
        OW["Open WebUI\nSelf-hosted Docker container\nCaddy reverse proxy"]
    end

    subgraph ModelBackends["Model Backends"]
        OLLAMA["Ollama\nRTX 3060 / 12 GB VRAM\nPrivate + enterprise lanes\n7B–13B class models"]
        CLAUDE_API["Claude API\nAnthropic Console account\nEnterprise + personal project lanes"]
        MAC_OLLAMA["Mac Ollama\nApple M5 / 24 GB unified\nPortable private inference"]
    end

    MAC_UI --> OW
    PHONE --> OW
    TABLET -.-> OW

    OW --> OLLAMA
    OW --> CLAUDE_API
    OW -.-> MAC_OLLAMA

    style OW fill:#1a3a2a,color:#86efac
    style OLLAMA fill:#1e3a5f,color:#93c5fd
    style CLAUDE_API fill:#3b1f6e,color:#c4b5fd
    style MAC_OLLAMA fill:#3f2a10,color:#fdba74
```

### Model Routing Summary

| Use case | Model | Lane |
|----------|-------|------|
| Client data processing | Claude API | Enterprise |
| Business documentation | Claude API | Enterprise |
| Personal project work | Claude API or Gemini API | Personal projects |
| Private/personal queries | Local Ollama only | Private |
| Financial or health queries | Local Ollama only | Private — hard requirement |

---

## AgentLab Orchestrator

The orchestrator is a separate coordination layer above Open WebUI. It routes work between
Claude Code, Codex CLI, and Gemini CLI and manages the multi-agent session substrate.

> **Current status:** Orchestrator Phase 1 substantially complete. Shelved for full deployment until Z640 rebuild is complete.

```mermaid
graph TD
    subgraph Orchestrator["AgentLab Orchestrator (Therapon)"]
        SUPER["Supervisor\nClaude — agent branch\norchestrator.py"]
        PLAN["Planner\nTask decomposition"]
        WORK_C["Worker — Claude Code"]
        WORK_X["Worker — Codex CLI\ncodex branch / worktree"]
        WORK_G["Worker — Gemini CLI\ngemini branch / worktree"]
        RESEARCH["Researcher\n(active)"]
        VERIFY["Verifier\n(stub — not wired)"]
        PRIV["Private-lane worker\n(stub — pending)"]
    end

    OP["Operator\niTerm2 + orchestrator profile"]

    OP -->|"Prompt via orchestrator profile"| SUPER
    SUPER --> PLAN
    PLAN --> WORK_C
    PLAN --> WORK_X
    PLAN --> WORK_G
    PLAN --> RESEARCH
    PLAN -.->|"pending"| VERIFY
    PLAN -.->|"pending"| PRIV

    style SUPER fill:#1a3a2a,color:#86efac
    style VERIFY fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
    style PRIV fill:#3f2a10,color:#fdba74,stroke-dasharray: 5 5
```

### Multi-Agent Git Model

| Branch | Owner | Purpose |
|--------|-------|---------|
| `main` | Human only | Production — never commit directly |
| `agent` | Claude (Supervisor) | Claude's working branch |
| `codex` | Codex CLI | Codex's working branch |
| `gemini` | Gemini CLI | Gemini's working branch |

Human promotes `agent` → `main` via `./tools/promote.sh <tag>` after review.

---

## Chat and Voice Access

| Device | Interface | Notes |
|--------|-----------|-------|
| Mac | Open WebUI — browser tab or home screen web app | Primary |
| iPhone | Open WebUI — browser shortcut, WireGuard VPN on | VPN required |
| Android tablet | Open WebUI — Phase Later | Planned |

### Voice Input

| Device | Tool | Status |
|--------|------|--------|
| Mac | SuperWhisper — local Parakeet model via WhisperKit, Apple Neural Engine | Active — all audio on-device |
| iPhone | Needs research — Apple Dictate is insufficient | Open |

---

## Backup and Recovery

| Layer | Target | Status |
|-------|--------|--------|
| Local Restic | VPS disk — fast-restore cache | Exists — restore validation not complete |
| Offsite | Cloudflare R2 Standard | Active — intended primary DR copy |
| Third target (3-2-1 completion) | Home lab (Z640/NUC) | Not active — blocked by Z640 rebuild |

> **True 3-2-1 is not yet complete.** R2 is the current disaster-recovery copy.
> Restore path has not been validated end-to-end.

---

## Open Architectural Decisions

| Topic | Summary | Status |
|-------|---------|--------|
| WireGuard hub failover | CHR01 is a single hub. Three options (CHR02 same-provider, move to VPS01, split roles). No decision made. | Open |
| VPS02 warm standby | Manual failover only. Sync method undefined. | Open |
| Three-lane model enforcement | Policy concept — not technically enforced. Routing and retention boundaries undefined. | Open |
| Business control-plane architecture | CRM, ticketing, PM selection reopened. Odoo not a settled answer. GLPI in scope for ticketing. Target: dashboard over multiple best-fit tools, not a monolithic app. | Open |
| Cross-agent verification loop | One agent answers → second independently verifies → discrepancies surface before operator acts. | Not started |

---

## Security Boundaries

| Boundary | Mechanism |
|----------|-----------|
| All external service access | WireGuard VPN required — no public-facing admin interfaces |
| Secrets management | Ansible vault — never in git |
| Agent execution boundary | No agent controls production execution directly — human Terminal gate for all Ansible runs |
| Private lane data | Never routed to frontier APIs — local Ollama only |
| SSH keys | Mac host only — not mounted in container |