On-Prem AI for Operations

Every team wants AI-assisted operations. Not every team can send their logs, metrics, and incident data to OpenAI. If you operate under GDPR constraints, data sovereignty requirements, or strict security policies, on-prem AI is not optional.

I built this exact setup: 6 GPUs across 2 servers in the datacenter, vLLM serving, a private RAG pipeline over 1000+ PDFs with PostgreSQL/pgvector, and Graphia, an RBAC-aware SRE agent that lets engineering teams query their Grafana infrastructure without needing to know LogQL or PromQL.

Your data stays on your hardware. Models run on your GPUs. Nothing leaks.

What's included

Model serving: vLLM or Ollama deployment on your GPU-equipped nodes, with OpenAI-compatible API
Hardware assessment: GPU sizing, memory requirements, inference latency targets, quantization strategy (GGUF/AWQ)
RAG pipeline: document ingestion, vector store setup (PostgreSQL/pgvector), embedding pipeline, retrieval-optimized prompting
SRE agent integration: Graphia-style agent that connects to your observability stack (Grafana queries, log search, incident context) with RBAC
Security hardening: access control, audit logging, prompt injection defenses, network isolation
Cost analysis: inference cost per query vs. cloud API pricing, break-even on GPU investment

Deliverables

Local LLM inference endpoint with OpenAI-compatible API, running on your infra
Private RAG pipeline indexed on your documentation and runbooks
Grafana/observability integration proof-of-concept with RBAC
Deployment manifests (Proxmox VM or Kubernetes) with GPU passthrough configured
Security and access-control documentation
Model evaluation report: quality, latency, and cost per query vs. cloud alternatives

Tech stack

vLLMOllamaRAGGPUProxmoxPythonMCP