← All missions

Skills Context — Sabri’s Real Experience

Working doc for posts, missions, and narrative refinement. NOT part of the Hugo build. Reference-only.


1. Observability

Grafana

  • Scale: 2000 users across 10 instances
  • Deployment: Both on-prem (deb packages + DBs) and full Docker via Helm
  • Datasources: Elasticsearch, VictoriaLogs, Splunk, Loki, Prometheus metrics
  • Dashboarding: Heavy transformation work across multiple datasources
  • Tooling built: “Grafana Housekeeping” — gathers all resources (users, dashboards, alerts, contact points, datasources), checks for stale/broken/unused, sends reporting to Jira for manager-driven cleanups
  • MCP: Used Grafana MCP

Prometheus / Thanos

  • Scale: Thanos cluster with global querier (to avoid having scattered Thanos instances in Grafana)
  • Topology: Storage gateway cache + read gateway cache across 4 different clusters
  • Storage: ~50TB managed, long-term (12-month) bucket
  • Nodes: ~200 replica nodes (8GB RAM, 3 CPU each)
  • Project focus: Maintaining and adding long-term storage for ML toolbox forecasting

Loki

  • Scale: 4 clusters, ~400 pods total, ~3TB RAM + significant CPU
  • Cache: 2TB memcache cluster for 24h hot storage
  • Work: Fine-tuning, installing cluster cache, experimenting with metric-splitting configurations
  • Recording rules: Converted TB/day load-balancer logs into metrics

Vector

  • Migration: Migrated from Fluentd to Vector due to Fluentd performance issues
  • Scope: Full pipeline migration + log transformations

ELK

  • Full stack: Beats, Kafka, Logstash (heavy transformations, also shipped to Splunk HEC), Elasticsearch, APM + RUM
  • Scale: 200TB across 3 datacenters, high-availability setup
  • Users: Team of 20 developers
  • Compliance: Different retention policies for gambling regulator purposes

2. Kubernetes & Platform

Kubernetes

  • Level: Primarily user-level (not cluster administrator), operates apps via CI/CD
  • Workflow: Jenkins + Makefile → build → push to JFrog → merged and deployed via ArgoCD
  • Helm: Created Helm charts for own apps (Graphia, search query exporter, Grafana Housekeeping)
  • Apps deployed: Graphia (SRE agent), search query exporter, Grafana Housekeeping tool

Search Query Exporter (app built at FDJ)

  • Purpose: Measure real user-experience search performance latency for SLO/SLIs
  • How it works: Queries Splunk, Thanos, and Loki across multiple time ranges (10m, 1h, 6h, 24h, 7d, 30d)
  • Output: Checks how the platform responds over time, compares against SLO budgets, flags when too slow
  • Design: Easily configurable, multi-backend querying

3. CI/CD

  • TODO: Still need detail on Jenkins, GitLab CI, GitHub Actions specifically

4. Automation & IaC

PXE / Proxmox

  • Automated PXE installation for Proxmox nodes using Ansible

Full VM Delivery Pipeline (“Kubernetes without Kubernetes”)

The complete flow, end to end:

Phase 1 — Request (Git PR)

  • Create a tfvars with VM config. ~80% of variables are shared across providers, ~20% provider-specific (e.g. network on OpenStack, flavors)
  • Push to git branch, open PR

Phase 2 — Plan (Jenkins)

  • Jenkins detects the directory, picks the correct Terraform workspace (env-scoped to avoid massive plan/apply times)
  • Runs terraform plan on the PR for review

Phase 3 — Apply (on merge)

  • On merge, Jenkins runs terraform apply
  • State backend: Consul (for state locking)
  • VM is created → automatically registers in Consul and NetBox

Phase 4 — Configuration (Ansible auto-discovery)

  • The tfvars includes an ansible_playbook parameter (e.g. “elasticsearch”)
  • On VM creation, Jenkins sets a NetBox custom field: ansible_playbook: elasticsearch
  • Another custom field: configured: false (default)
  • Jenkins launches a meta-playbook that uses dynamic inventory to find all VMs where configured: false
  • Reads the ansible_playbook field, runs that specific playbook on the VM
  • On success, flips configured: true in NetBox

Phase 5 — Service Registration (Consul → HAProxy)

  • Consul service list feeds HAProxy for automatic backend host registration
  • New VM is live and receiving traffic with zero manual intervention

Phase 6 — Monitoring (Centreon auto-registration)

  • Built a custom Terraform provider for Centreon (terraform-provider-centreon, open-sourced on GitHub)
  • VMs are automatically registered in Centreon monitoring as part of the pipeline
  • Built because no Centreon Terraform provider existed at the time

Context: No Kubernetes available, so they built orchestration primitives from existing tooling to handle hundreds of VMs across datacenters during migrations.

Jenkins + Terraform (Atlantis-like)

  • Part of the pipeline above: Jenkins auto-runs terraform apply on merge with diff review

Terraform Multi-Provider Module System

  • Created a module template system using the same variable files across providers
  • Supported providers: vSphere, Proxmox, OpenStack, NetBox
  • ~80% shared variables, ~20% provider-specific

Cloud & Infrastructure

AWS

  • Usage: Deployed ECS EC2, CloudWatch
  • Cost optimization: Used CUR (Cost and Usage Report) to find and solve budget issues, helped teams build dashboards
  • Self-assessment: Not very strong, needs more depth. Being honest about this.

OpenStack

  • Admin: Deployed a small 8-node cluster with Ceph, then migrated to Proxmox for ease of use (“it was a bit of a machine”)
  • User: Heavy Terraform user on OpenStack — deployed across multiple DCs (volumes, images, etc.)
  • Positioning: Knows the platform well as a consumer, less as an admin

OVH Cloud

  • Usage: Used like Hetzner — VMs and dedicated servers for client deployments

Proxmox

  • Scale: Deployed 100+ Proxmox nodes
  • Automation: Automated installation via PXE + Ansible
  • Storage backends: NFS, ZFS, SAN, NVMe-oF (NVMe over Fabric), Ceph (managed by Proxmox)
  • Core infrastructure skill — deep operational experience

Ceph

  • Deployment: Big Ceph RBD and Ceph S3 for Loki/OpenStack, deployed in VMs with cephadm
  • Scope: Helped deploy and operate

Linux

  • Standard sysadmin-level, “nothing too fancy”

6. Data & Messaging

PostgreSQL

  • Clustering: Installed Patroni + ETCD for HA clustering
  • Backup: PGBackrest
  • Frontend: Dedicated HAProxy frontend for the DBs
  • Level: Solid operational experience

MySQL

  • Installed and managed several times, no clustering experience

SQL Server

  • Same level as MySQL — install and manage

Kafka

  • Scale: 5 Kafka nodes
  • Use case: Buffer for Logstash → Elasticsearch pipelines
  • Topics: Different topics per pipeline, fine-tuned for the workload
  • Coordination: Zookeeper (maintained it — “was a pain but it’s powerful/complicated”)

Redis

  • Tried OSS clustering, wasn’t good enough for the requirements
  • Switched to Couchbase (paid) instead

RabbitMQ

  • Messaging and message routing, standard cluster usage, nothing fancy

Couchbase

  • Impact: Big part of unibet.fr performance gains
  • Scale: 10 servers x 64GB RAM
  • Design: Multiple indexes, split and sharded
  • Role: Managed everything AND advised developers on how to use it for performance gains
  • Note: Replaced Redis because Redis OSS clustering wasn’t cutting it

7. Networking & Delivery

Load Balancers

  • Breadth: F5, F5 Cloud, Nginx, HAProxy (with BGP), Envoy, Traefik
  • DDoS mitigation project: Big project with F5 Cloud using rate limiting and various tools
  • HAProxy + BGP: Used for the Consul-fed auto-registration backend system (from VM pipeline)

DNS

  • Tools: BIND, PowerDNS, Cloudflare DNS
  • Notable: Multi-DC BIND public DNS with secured zone transfers (TSIG keys)
  • Scale: Nothing massive, but solid

VPN

  • Tech: WireGuard, IPsec, GRE, OpenVPN
  • Use cases: Router-to-router and client VPN
  • Level: Standard usage, no major project

Firewalls

  • Big project: Migrated from SPOF FortiGuard perimeter firewall to BGP routers, pushing iptables/nftables rules to 1000+ VMs via Ansible
  • Rationale: Moved firewalling to the VM level instead of the router (eliminated single point of failure)
  • Hardware experience: Cisco ASA, Firepower, FortiGuard (clustering, VDOM)

Networking (Hardware)

  • Switches/Routers managed:
    • Cisco Nexus 3000 (installation + management)
    • Juniper MX204 (BGP, limited experience)
    • Juniper EX series
  • NDR / Traffic Analysis: Installed ExtraHop NDR with Prismatic TAP (network traffic mirroring for analysis)
  • Level: Solid sysadmin-adjacent networking, not a dedicated network engineer

8. Programming & AI

Go

  • Limited experience, mainly the Centreon Terraform provider

Python

  • Built multiple apps (SSHPlex, Search Query Exporter, Grafana Housekeeping, Graphia)
  • Full CI/CD lifecycle: build, deploy, publish to PyPI
  • Strong operational Python, not a software engineer

Bash

  • Heavy scripting for system installs and automation glue

HCL

  • Extensive Terraform + Consul usage

YAML / JSON

  • Ansible, Python configs, standard tooling

AI / LLM Infrastructure

  • GPU deployment: Built and installed 6 GPUs in datacenter across 2x 2U servers (4-GPU capability each)
  • Serving: vLLM, Python client
  • RAG pipeline: Embedder model + 1000+ PDFs + PostgreSQL vector DB + GPT OSS 20B
  • Agent: Built a full-loop agent with safeguards and RBAC (team members get limited resource access)
  • Graphia: SRE agent for Grafana diagnosis — RBAC-aware, MCP-based. Arch diagram incoming from work laptop.

MCP (Model Context Protocol)

  • Installed many MCP servers
  • Built a Graphia MCP for devs/infra engineers (same RBAC, usage tracking)

9. Security

Unibet.fr Scope Responsibility

  • Vulnerability management: Followed up on Tenable/Nessus vulns and patches
  • Architecture review: Checked PoCs and provided security inputs for new implementations (BIND, Packer, etc.)
  • DDoS mitigation: Installed F5 Cloud as mitigation layer
  • Network monitoring: Installed ExtraHop + alerts (DB dump detection, network anomalies)

10. Physical / Datacenter

Rack Installation

  • Installed 4-5 full racks from scratch: network → server → SAN for virtualization
  • Knows cabling, on-site intervention

NetBox + HA Architecture Design

  • Managed NetBox as source of truth
  • Designed HA Proxmox architectures with LACP/EVPN/VPC

11. Learning Gaps (self-identified)

  • AWS / cloud depth
  • BGP routing
  • eBPF
  • Full rack design best practices (A to Z)
  • FinOps
  • Formal SRE incident management processes
  • Tracing (Grafana Alloy, OpenTelemetry) — did a little bit

12. Team & Leadership

Working Style

  • 80% solo: Most projects owned end-to-end
  • 20% team/project lead: For Proxmox and infrastructure projects
    • Splits tasks, does knowledge transfer
    • Uses RAPID framework for validation
    • Runs Spike sessions before starting to scope properly
    • Weekly meetings to check blockers and progress
    • Still does ~50% of the hands-on work even when leading

13. Scale Numbers (Consolidated)

VM Fleet

  • Peak: ~3000 VMs under management
  • Migrations: Recreated most of them for cross-DC migrations (vSphere → Proxmox, then some → OpenStack, then OpenStack DC → OpenStack DC)

Datacenters

  • Org scale: 30-40 DCs total, used 10 for replications (OpenStack, mostly usage not admin)
  • Directly managed: 2 DCs / 8 racks at one point
  • VINC era: 4 racks / 30 nodes

Ingestion (The Real Numbers)

  • Loki: 8 TB/day
  • Elasticsearch: 2 TB/day
  • Thanos: 1 TB/day
  • Combined: ~11 TB/day

14. War Stories

The Champions League Final Incident

  • Context: Gambling platform — when a match settles (especially Champions League final), 10,000+ users connect within the SAME MINUTE. Business-driven traffic spike, not a DDoS, but hits like one.
  • The change: Migrated firewall to nftables. Worked fine for 30 days.
  • The incident: During a Champions League final, massive user spike. nftables conntrack table filled up → some users couldn’t connect, others could. TCP window times were extremely long.
  • The fix: Killed all HAProxy and nftables, bumped nftables conntrack limits, set aggressive TCP windows and TTL values.
  • How they found it: Grafana dashboard caught the anomaly.
  • Lesson: Conntrack sizing matters for bursty traffic. Default values don’t survive gambling-scale spikes.

16. Migration Projects (Detail)

a. vSphere → Proxmox Migration

  • Driver: vSphere licenses costly + EoL
  • Approach: Installed Proxmox, converted/rebuilt VMs node by node from vSphere
  • Storage: Pure Storage SAN with NVMe-oF (MultipathD)
  • Backup: Found and integrated Proxmox Backup Server as part of the migration

b. Proxmox 4 → Proxmox 8 Upgrade

  • Old cluster: 6 nodes, Proxmox 4
  • New cluster: 4 nodes, Proxmox 8
  • Challenge: OpenVZ containers to convert with zero or low downtime
  • Solution: Used NFS as a buffer storage during the migration

c. OpenStack DC → OpenStack DC Migration

  • Driver: Source DC got decommissioned
  • Approach: Used Terraform to rebuild VMs, Ansible to install, then switched HAProxy backends to new DC bit by bit

d. Elasticsearch 200TB Cluster Migration

  • Scope: 200TB ES cluster, DC to DC
  • Approach: Rebuilt nodes 1 by 1 in the new DC, waited for shard rebalancing or forced shards to new nodes

17. Graphia — Pain Points Solved

The Problem

  • Grafana dashboards require deep expertise to build well (“need a PhD”)
  • Companies have 1000+ datasources — onboarding is slow
  • Query languages (LogQL, PromQL, Splunk SPL) are a barrier
  • Correlating metrics and logs across backends is manual and slow
  • Jumping on an alert incident means starting from scratch every time

What Graphia Does

  • Removes the need to be a dashboard expert
  • Abstracts away datasource complexity (fast onboarding)
  • Eliminates query language knowledge requirement
  • Correlates data across metrics and logs
  • Prepares a battle plan when jumping on an alert incident
  • RBAC-aware: team members get access only to their permitted resources
  • Usage tracking built in

18. Honesty Notes (from Sabri)

  • The 11TB/day observability stack was TEAM work, not solo. Don’t oversell as “I built this alone.”
  • Loki was mostly already there — his contribution was fine-tuning + Fluentd→Vector migration + recording rules
  • He didn’t deploy 100% of the 3000 VMs himself — he BUILT THE TOOLING that enabled it
  • Position as “operated and optimized” or “built the automation for” rather than “single-handedly ran”
  • He’d be nervous taking on a solo “build me a new rack that ingests 10TB/day” mission

15. The “Why Freelance” Story

Kindred Acquisition

  • Before Kindred acquired (specifically France): startup-like culture, 5 people in infra, 10 in dev. Things moved fast.
  • After acquisition: all process and drama. Speed died.
  • Took a voluntary departure plan (“Plan de départ volontaire”) — the financial runway made the exit possible.