← All missions

Every team wants AI-assisted operations. Not every team can send their logs, metrics, and incident data to OpenAI. If you operate under GDPR constraints, data sovereignty requirements, or strict security policies, on-prem AI is not optional.

I built this exact setup: 6 GPUs across 2 servers in the datacenter, vLLM serving, a private RAG pipeline over 1000+ PDFs with PostgreSQL/pgvector, and Graphia, an RBAC-aware SRE agent that lets engineering teams query their Grafana infrastructure without needing to know LogQL or PromQL.

Your data stays on your hardware. Models run on your GPUs. Nothing leaks.

What's included

  • Model serving: vLLM or Ollama deployment on your GPU-equipped nodes, with OpenAI-compatible API
  • Hardware assessment: GPU sizing, memory requirements, inference latency targets, quantization strategy (GGUF/AWQ)
  • RAG pipeline: document ingestion, vector store setup (PostgreSQL/pgvector), embedding pipeline, retrieval-optimized prompting
  • SRE agent integration: Graphia-style agent that connects to your observability stack (Grafana queries, log search, incident context) with RBAC
  • Security hardening: access control, audit logging, prompt injection defenses, network isolation
  • Cost analysis: inference cost per query vs. cloud API pricing, break-even on GPU investment

Deliverables

  • Local LLM inference endpoint with OpenAI-compatible API, running on your infra
  • Private RAG pipeline indexed on your documentation and runbooks
  • Grafana/observability integration proof-of-concept with RBAC
  • Deployment manifests (Proxmox VM or Kubernetes) with GPU passthrough configured
  • Security and access-control documentation
  • Model evaluation report: quality, latency, and cost per query vs. cloud alternatives

Tech stack

vLLMOllamaRAGGPUProxmoxPythonMCP