Missions on RetakeData

Observability Stack Audit

Mon, 01 Jan 0001 00:00:00 +0000

You have dashboards nobody trusts, alerts that fire at 3 AM for no reason, and a Loki bill that keeps climbing. The stack works, technically. But it is not working for your team.

I operated an 11 TB/day observability stack across Loki (400 pods, 3TB RAM), Elasticsearch (200TB, 3 DCs), and Thanos (50TB, 200 nodes). My specific work: migrating Fluentd to Vector for performance, fine-tuning Loki with memcache clusters, building recording rules that converted TB/day of load-balancer logs into queryable metrics, and maintaining 12-month Thanos storage for ML forecasting.

VM Migration Factory

Mon, 01 Jan 0001 00:00:00 +0000

Most teams operating hundreds of VMs do it manually: ticket, provision, configure, register, repeat. Every VM is slightly different. Every migration is a project. Every provider has its own quirks that someone has to remember.

This mission replaces that with a pipeline. I built this exact system to manage 3000+ VMs across 4 providers and 10 datacenters, without Kubernetes. You open a PR with a tfvars file, merge it, and the VM goes from nothing to configured, monitored, and receiving traffic.

Proxmox / Ceph HA Platform

Mon, 01 Jan 0001 00:00:00 +0000

VMware licensing changes pushed a lot of teams to look for alternatives. Proxmox is the answer, but a Proxmox cluster that survives real failure scenarios needs proper Ceph design, network segmentation, quorum tuning, and automated provisioning.

I deployed 100+ Proxmox nodes with PXE automation across every storage backend: ZFS, NFS, SAN, NVMe over Fabric, and Ceph. I led vSphere-to-Proxmox migrations (Pure Storage SAN, NVMe-oF, MultipathD), Proxmox 4-to-8 upgrades with near-zero downtime using NFS buffer, and designed HA architectures with LACP/EVPN/VPC.

On-Prem AI for Operations

Mon, 01 Jan 0001 00:00:00 +0000

Every team wants AI-assisted operations. Not every team can send their logs, metrics, and incident data to OpenAI. If you operate under GDPR constraints, data sovereignty requirements, or strict security policies, on-prem AI is not optional.

I built this exact setup: 6 GPUs across 2 servers in the datacenter, vLLM serving, a private RAG pipeline over 1000+ PDFs with PostgreSQL/pgvector, and Graphia, an RBAC-aware SRE agent that lets engineering teams query their Grafana infrastructure without needing to know LogQL or PromQL.

Mon, 01 Jan 0001 00:00:00 +0000

Skills Context — Sabri’s Real Experience

Working doc for posts, missions, and narrative refinement. NOT part of the Hugo build. Reference-only.

1. Observability

Grafana

Scale: 2000 users across 10 instances
Deployment: Both on-prem (deb packages + DBs) and full Docker via Helm
Datasources: Elasticsearch, VictoriaLogs, Splunk, Loki, Prometheus metrics
Dashboarding: Heavy transformation work across multiple datasources
Tooling built: “Grafana Housekeeping” — gathers all resources (users, dashboards, alerts, contact points, datasources), checks for stale/broken/unused, sends reporting to Jira for manager-driven cleanups
MCP: Used Grafana MCP

Prometheus / Thanos

Scale: Thanos cluster with global querier (to avoid having scattered Thanos instances in Grafana)
Topology: Storage gateway cache + read gateway cache across 4 different clusters
Storage: ~50TB managed, long-term (12-month) bucket
Nodes: ~200 replica nodes (8GB RAM, 3 CPU each)
Project focus: Maintaining and adding long-term storage for ML toolbox forecasting

Loki

Scale: 4 clusters, ~400 pods total, ~3TB RAM + significant CPU
Cache: 2TB memcache cluster for 24h hot storage
Work: Fine-tuning, installing cluster cache, experimenting with metric-splitting configurations
Recording rules: Converted TB/day load-balancer logs into metrics

Vector

Migration: Migrated from Fluentd to Vector due to Fluentd performance issues
Scope: Full pipeline migration + log transformations

ELK

Full stack: Beats, Kafka, Logstash (heavy transformations, also shipped to Splunk HEC), Elasticsearch, APM + RUM
Scale: 200TB across 3 datacenters, high-availability setup
Users: Team of 20 developers
Compliance: Different retention policies for gambling regulator purposes

2. Kubernetes & Platform

Kubernetes

Level: Primarily user-level (not cluster administrator), operates apps via CI/CD
Workflow: Jenkins + Makefile → build → push to JFrog → merged and deployed via ArgoCD
Helm: Created Helm charts for own apps (Graphia, search query exporter, Grafana Housekeeping)
Apps deployed: Graphia (SRE agent), search query exporter, Grafana Housekeeping tool

Search Query Exporter (app built at FDJ)

Purpose: Measure real user-experience search performance latency for SLO/SLIs
How it works: Queries Splunk, Thanos, and Loki across multiple time ranges (10m, 1h, 6h, 24h, 7d, 30d)
Output: Checks how the platform responds over time, compares against SLO budgets, flags when too slow
Design: Easily configurable, multi-backend querying

3. CI/CD

TODO: Still need detail on Jenkins, GitLab CI, GitHub Actions specifically

4. Automation & IaC

PXE / Proxmox

Automated PXE installation for Proxmox nodes using Ansible

Full VM Delivery Pipeline (“Kubernetes without Kubernetes”)

The complete flow, end to end: