<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Missions on RetakeData</title><link>https://retakedata.com/missions/</link><description>Recent content in Missions on RetakeData</description><generator>Hugo</generator><language>en</language><copyright>&lt;a href="https://creativecommons.org/licenses/by-nc/4.0/" target="_blank" rel="noopener">CC BY-NC 4.0&lt;/a></copyright><atom:link href="https://retakedata.com/missions/index.xml" rel="self" type="application/rss+xml"/><item><title>Observability Stack Audit</title><link>https://retakedata.com/missions/observability-stack-audit/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://retakedata.com/missions/observability-stack-audit/</guid><description>&lt;p>You have dashboards nobody trusts, alerts that fire at 3 AM for no reason, and a Loki bill that keeps climbing. The stack works, technically. But it is not working for your team.&lt;/p>
&lt;p>I operated an 11 TB/day observability stack across Loki (400 pods, 3TB RAM), Elasticsearch (200TB, 3 DCs), and Thanos (50TB, 200 nodes). My specific work: migrating Fluentd to Vector for performance, fine-tuning Loki with memcache clusters, building recording rules that converted TB/day of load-balancer logs into queryable metrics, and maintaining 12-month Thanos storage for ML forecasting.&lt;/p></description></item><item><title>VM Migration Factory</title><link>https://retakedata.com/missions/vm-migration-factory/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://retakedata.com/missions/vm-migration-factory/</guid><description>&lt;p>Most teams operating hundreds of VMs do it manually: ticket, provision, configure, register, repeat. Every VM is slightly different. Every migration is a project. Every provider has its own quirks that someone has to remember.&lt;/p>
&lt;p>This mission replaces that with a pipeline. I built this exact system to manage 3000+ VMs across 4 providers and 10 datacenters, without Kubernetes. You open a PR with a tfvars file, merge it, and the VM goes from nothing to configured, monitored, and receiving traffic.&lt;/p></description></item><item><title>Proxmox / Ceph HA Platform</title><link>https://retakedata.com/missions/proxmox-ceph-ha-platform/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://retakedata.com/missions/proxmox-ceph-ha-platform/</guid><description>&lt;p>VMware licensing changes pushed a lot of teams to look for alternatives. Proxmox is the answer, but a Proxmox cluster that survives real failure scenarios needs proper Ceph design, network segmentation, quorum tuning, and automated provisioning.&lt;/p>
&lt;p>I deployed 100+ Proxmox nodes with PXE automation across every storage backend: ZFS, NFS, SAN, NVMe over Fabric, and Ceph. I led vSphere-to-Proxmox migrations (Pure Storage SAN, NVMe-oF, MultipathD), Proxmox 4-to-8 upgrades with near-zero downtime using NFS buffer, and designed HA architectures with LACP/EVPN/VPC.&lt;/p></description></item><item><title>On-Prem AI for Operations</title><link>https://retakedata.com/missions/onprem-ai-operations/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://retakedata.com/missions/onprem-ai-operations/</guid><description>&lt;p>Every team wants AI-assisted operations. Not every team can send their logs, metrics, and incident data to OpenAI. If you operate under GDPR constraints, data sovereignty requirements, or strict security policies, on-prem AI is not optional.&lt;/p>
&lt;p>I built this exact setup: 6 GPUs across 2 servers in the datacenter, vLLM serving, a private RAG pipeline over 1000+ PDFs with PostgreSQL/pgvector, and Graphia, an RBAC-aware SRE agent that lets engineering teams query their Grafana infrastructure without needing to know LogQL or PromQL.&lt;/p></description></item><item><title/><link>https://retakedata.com/missions/skills-context/</link><pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate><guid>https://retakedata.com/missions/skills-context/</guid><description>&lt;h1 id="skills-context--sabris-real-experience">Skills Context — Sabri&amp;rsquo;s Real Experience&lt;/h1>
&lt;blockquote>
&lt;p>Working doc for posts, missions, and narrative refinement.
NOT part of the Hugo build. Reference-only.&lt;/p>
&lt;/blockquote>
&lt;hr>
&lt;h2 id="1-observability">1. Observability&lt;/h2>
&lt;h3 id="grafana">Grafana&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Scale&lt;/strong>: 2000 users across 10 instances&lt;/li>
&lt;li>&lt;strong>Deployment&lt;/strong>: Both on-prem (deb packages + DBs) and full Docker via Helm&lt;/li>
&lt;li>&lt;strong>Datasources&lt;/strong>: Elasticsearch, VictoriaLogs, Splunk, Loki, Prometheus metrics&lt;/li>
&lt;li>&lt;strong>Dashboarding&lt;/strong>: Heavy transformation work across multiple datasources&lt;/li>
&lt;li>&lt;strong>Tooling built&lt;/strong>: &amp;ldquo;Grafana Housekeeping&amp;rdquo; — gathers all resources (users, dashboards, alerts, contact points, datasources), checks for stale/broken/unused, sends reporting to Jira for manager-driven cleanups&lt;/li>
&lt;li>&lt;strong>MCP&lt;/strong>: Used Grafana MCP&lt;/li>
&lt;/ul>
&lt;h3 id="prometheus--thanos">Prometheus / Thanos&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Scale&lt;/strong>: Thanos cluster with global querier (to avoid having scattered Thanos instances in Grafana)&lt;/li>
&lt;li>&lt;strong>Topology&lt;/strong>: Storage gateway cache + read gateway cache across 4 different clusters&lt;/li>
&lt;li>&lt;strong>Storage&lt;/strong>: ~50TB managed, long-term (12-month) bucket&lt;/li>
&lt;li>&lt;strong>Nodes&lt;/strong>: ~200 replica nodes (8GB RAM, 3 CPU each)&lt;/li>
&lt;li>&lt;strong>Project focus&lt;/strong>: Maintaining and adding long-term storage for ML toolbox forecasting&lt;/li>
&lt;/ul>
&lt;h3 id="loki">Loki&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Scale&lt;/strong>: 4 clusters, ~400 pods total, ~3TB RAM + significant CPU&lt;/li>
&lt;li>&lt;strong>Cache&lt;/strong>: 2TB memcache cluster for 24h hot storage&lt;/li>
&lt;li>&lt;strong>Work&lt;/strong>: Fine-tuning, installing cluster cache, experimenting with metric-splitting configurations&lt;/li>
&lt;li>&lt;strong>Recording rules&lt;/strong>: Converted TB/day load-balancer logs into metrics&lt;/li>
&lt;/ul>
&lt;h3 id="vector">Vector&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Migration&lt;/strong>: Migrated from Fluentd to Vector due to Fluentd performance issues&lt;/li>
&lt;li>&lt;strong>Scope&lt;/strong>: Full pipeline migration + log transformations&lt;/li>
&lt;/ul>
&lt;h3 id="elk">ELK&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Full stack&lt;/strong>: Beats, Kafka, Logstash (heavy transformations, also shipped to Splunk HEC), Elasticsearch, APM + RUM&lt;/li>
&lt;li>&lt;strong>Scale&lt;/strong>: 200TB across 3 datacenters, high-availability setup&lt;/li>
&lt;li>&lt;strong>Users&lt;/strong>: Team of 20 developers&lt;/li>
&lt;li>&lt;strong>Compliance&lt;/strong>: Different retention policies for gambling regulator purposes&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="2-kubernetes--platform">2. Kubernetes &amp;amp; Platform&lt;/h2>
&lt;h3 id="kubernetes">Kubernetes&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Level&lt;/strong>: Primarily user-level (not cluster administrator), operates apps via CI/CD&lt;/li>
&lt;li>&lt;strong>Workflow&lt;/strong>: Jenkins + Makefile → build → push to JFrog → merged and deployed via ArgoCD&lt;/li>
&lt;li>&lt;strong>Helm&lt;/strong>: Created Helm charts for own apps (Graphia, search query exporter, Grafana Housekeeping)&lt;/li>
&lt;li>&lt;strong>Apps deployed&lt;/strong>: Graphia (SRE agent), search query exporter, Grafana Housekeeping tool&lt;/li>
&lt;/ul>
&lt;h3 id="search-query-exporter-app-built-at-fdj">Search Query Exporter (app built at FDJ)&lt;/h3>
&lt;ul>
&lt;li>&lt;strong>Purpose&lt;/strong>: Measure real user-experience search performance latency for SLO/SLIs&lt;/li>
&lt;li>&lt;strong>How it works&lt;/strong>: Queries Splunk, Thanos, and Loki across multiple time ranges (10m, 1h, 6h, 24h, 7d, 30d)&lt;/li>
&lt;li>&lt;strong>Output&lt;/strong>: Checks how the platform responds over time, compares against SLO budgets, flags when too slow&lt;/li>
&lt;li>&lt;strong>Design&lt;/strong>: Easily configurable, multi-backend querying&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="3-cicd">3. CI/CD&lt;/h2>
&lt;ul>
&lt;li>TODO: Still need detail on Jenkins, GitLab CI, GitHub Actions specifically&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="4-automation--iac">4. Automation &amp;amp; IaC&lt;/h2>
&lt;h3 id="pxe--proxmox">PXE / Proxmox&lt;/h3>
&lt;ul>
&lt;li>Automated PXE installation for Proxmox nodes using Ansible&lt;/li>
&lt;/ul>
&lt;h3 id="full-vm-delivery-pipeline-kubernetes-without-kubernetes">Full VM Delivery Pipeline (&amp;ldquo;Kubernetes without Kubernetes&amp;rdquo;)&lt;/h3>
&lt;p>The complete flow, end to end:&lt;/p></description></item></channel></rss>