Mission %!d(<nil>)
Skills Context — Sabri’s Real Experience
Working doc for posts, missions, and narrative refinement. NOT part of the Hugo build. Reference-only.
1. Observability
Grafana
- Scale: 2000 users across 10 instances
- Deployment: Both on-prem (deb packages + DBs) and full Docker via Helm
- Datasources: Elasticsearch, VictoriaLogs, Splunk, Loki, Prometheus metrics
- Dashboarding: Heavy transformation work across multiple datasources
- Tooling built: “Grafana Housekeeping” — gathers all resources (users, dashboards, alerts, contact points, datasources), checks for stale/broken/unused, sends reporting to Jira for manager-driven cleanups
- MCP: Used Grafana MCP
Prometheus / Thanos
- Scale: Thanos cluster with global querier (to avoid having scattered Thanos instances in Grafana)
- Topology: Storage gateway cache + read gateway cache across 4 different clusters
- Storage: ~50TB managed, long-term (12-month) bucket
- Nodes: ~200 replica nodes (8GB RAM, 3 CPU each)
- Project focus: Maintaining and adding long-term storage for ML toolbox forecasting
Loki
- Scale: 4 clusters, ~400 pods total, ~3TB RAM + significant CPU
- Cache: 2TB memcache cluster for 24h hot storage
- Work: Fine-tuning, installing cluster cache, experimenting with metric-splitting configurations
- Recording rules: Converted TB/day load-balancer logs into metrics
Vector
- Migration: Migrated from Fluentd to Vector due to Fluentd performance issues
- Scope: Full pipeline migration + log transformations
ELK
- Full stack: Beats, Kafka, Logstash (heavy transformations, also shipped to Splunk HEC), Elasticsearch, APM + RUM
- Scale: 200TB across 3 datacenters, high-availability setup
- Users: Team of 20 developers
- Compliance: Different retention policies for gambling regulator purposes
2. Kubernetes & Platform
Kubernetes
- Level: Primarily user-level (not cluster administrator), operates apps via CI/CD
- Workflow: Jenkins + Makefile → build → push to JFrog → merged and deployed via ArgoCD
- Helm: Created Helm charts for own apps (Graphia, search query exporter, Grafana Housekeeping)
- Apps deployed: Graphia (SRE agent), search query exporter, Grafana Housekeeping tool
Search Query Exporter (app built at FDJ)
- Purpose: Measure real user-experience search performance latency for SLO/SLIs
- How it works: Queries Splunk, Thanos, and Loki across multiple time ranges (10m, 1h, 6h, 24h, 7d, 30d)
- Output: Checks how the platform responds over time, compares against SLO budgets, flags when too slow
- Design: Easily configurable, multi-backend querying
3. CI/CD
- TODO: Still need detail on Jenkins, GitLab CI, GitHub Actions specifically
4. Automation & IaC
PXE / Proxmox
- Automated PXE installation for Proxmox nodes using Ansible
Full VM Delivery Pipeline (“Kubernetes without Kubernetes”)
The complete flow, end to end:
Phase 1 — Request (Git PR)
- Create a tfvars with VM config. ~80% of variables are shared across providers, ~20% provider-specific (e.g. network on OpenStack, flavors)
- Push to git branch, open PR
Phase 2 — Plan (Jenkins)
- Jenkins detects the directory, picks the correct Terraform workspace (env-scoped to avoid massive plan/apply times)
- Runs
terraform planon the PR for review
Phase 3 — Apply (on merge)
- On merge, Jenkins runs
terraform apply - State backend: Consul (for state locking)
- VM is created → automatically registers in Consul and NetBox
Phase 4 — Configuration (Ansible auto-discovery)
- The tfvars includes an
ansible_playbookparameter (e.g. “elasticsearch”) - On VM creation, Jenkins sets a NetBox custom field:
ansible_playbook: elasticsearch - Another custom field:
configured: false(default) - Jenkins launches a meta-playbook that uses dynamic inventory to find all VMs where
configured: false - Reads the
ansible_playbookfield, runs that specific playbook on the VM - On success, flips
configured: truein NetBox
Phase 5 — Service Registration (Consul → HAProxy)
- Consul service list feeds HAProxy for automatic backend host registration
- New VM is live and receiving traffic with zero manual intervention
Phase 6 — Monitoring (Centreon auto-registration)
- Built a custom Terraform provider for Centreon (
terraform-provider-centreon, open-sourced on GitHub) - VMs are automatically registered in Centreon monitoring as part of the pipeline
- Built because no Centreon Terraform provider existed at the time
Context: No Kubernetes available, so they built orchestration primitives from existing tooling to handle hundreds of VMs across datacenters during migrations.
Jenkins + Terraform (Atlantis-like)
- Part of the pipeline above: Jenkins auto-runs terraform apply on merge with diff review
Terraform Multi-Provider Module System
- Created a module template system using the same variable files across providers
- Supported providers: vSphere, Proxmox, OpenStack, NetBox
- ~80% shared variables, ~20% provider-specific
Cloud & Infrastructure
AWS
- Usage: Deployed ECS EC2, CloudWatch
- Cost optimization: Used CUR (Cost and Usage Report) to find and solve budget issues, helped teams build dashboards
- Self-assessment: Not very strong, needs more depth. Being honest about this.
OpenStack
- Admin: Deployed a small 8-node cluster with Ceph, then migrated to Proxmox for ease of use (“it was a bit of a machine”)
- User: Heavy Terraform user on OpenStack — deployed across multiple DCs (volumes, images, etc.)
- Positioning: Knows the platform well as a consumer, less as an admin
OVH Cloud
- Usage: Used like Hetzner — VMs and dedicated servers for client deployments
Proxmox
- Scale: Deployed 100+ Proxmox nodes
- Automation: Automated installation via PXE + Ansible
- Storage backends: NFS, ZFS, SAN, NVMe-oF (NVMe over Fabric), Ceph (managed by Proxmox)
- Core infrastructure skill — deep operational experience
Ceph
- Deployment: Big Ceph RBD and Ceph S3 for Loki/OpenStack, deployed in VMs with cephadm
- Scope: Helped deploy and operate
Linux
- Standard sysadmin-level, “nothing too fancy”
6. Data & Messaging
PostgreSQL
- Clustering: Installed Patroni + ETCD for HA clustering
- Backup: PGBackrest
- Frontend: Dedicated HAProxy frontend for the DBs
- Level: Solid operational experience
MySQL
- Installed and managed several times, no clustering experience
SQL Server
- Same level as MySQL — install and manage
Kafka
- Scale: 5 Kafka nodes
- Use case: Buffer for Logstash → Elasticsearch pipelines
- Topics: Different topics per pipeline, fine-tuned for the workload
- Coordination: Zookeeper (maintained it — “was a pain but it’s powerful/complicated”)
Redis
- Tried OSS clustering, wasn’t good enough for the requirements
- Switched to Couchbase (paid) instead
RabbitMQ
- Messaging and message routing, standard cluster usage, nothing fancy
Couchbase
- Impact: Big part of unibet.fr performance gains
- Scale: 10 servers x 64GB RAM
- Design: Multiple indexes, split and sharded
- Role: Managed everything AND advised developers on how to use it for performance gains
- Note: Replaced Redis because Redis OSS clustering wasn’t cutting it
7. Networking & Delivery
Load Balancers
- Breadth: F5, F5 Cloud, Nginx, HAProxy (with BGP), Envoy, Traefik
- DDoS mitigation project: Big project with F5 Cloud using rate limiting and various tools
- HAProxy + BGP: Used for the Consul-fed auto-registration backend system (from VM pipeline)
DNS
- Tools: BIND, PowerDNS, Cloudflare DNS
- Notable: Multi-DC BIND public DNS with secured zone transfers (TSIG keys)
- Scale: Nothing massive, but solid
VPN
- Tech: WireGuard, IPsec, GRE, OpenVPN
- Use cases: Router-to-router and client VPN
- Level: Standard usage, no major project
Firewalls
- Big project: Migrated from SPOF FortiGuard perimeter firewall to BGP routers, pushing iptables/nftables rules to 1000+ VMs via Ansible
- Rationale: Moved firewalling to the VM level instead of the router (eliminated single point of failure)
- Hardware experience: Cisco ASA, Firepower, FortiGuard (clustering, VDOM)
Networking (Hardware)
- Switches/Routers managed:
- Cisco Nexus 3000 (installation + management)
- Juniper MX204 (BGP, limited experience)
- Juniper EX series
- NDR / Traffic Analysis: Installed ExtraHop NDR with Prismatic TAP (network traffic mirroring for analysis)
- Level: Solid sysadmin-adjacent networking, not a dedicated network engineer
8. Programming & AI
Go
- Limited experience, mainly the Centreon Terraform provider
Python
- Built multiple apps (SSHPlex, Search Query Exporter, Grafana Housekeeping, Graphia)
- Full CI/CD lifecycle: build, deploy, publish to PyPI
- Strong operational Python, not a software engineer
Bash
- Heavy scripting for system installs and automation glue
HCL
- Extensive Terraform + Consul usage
YAML / JSON
- Ansible, Python configs, standard tooling
AI / LLM Infrastructure
- GPU deployment: Built and installed 6 GPUs in datacenter across 2x 2U servers (4-GPU capability each)
- Serving: vLLM, Python client
- RAG pipeline: Embedder model + 1000+ PDFs + PostgreSQL vector DB + GPT OSS 20B
- Agent: Built a full-loop agent with safeguards and RBAC (team members get limited resource access)
- Graphia: SRE agent for Grafana diagnosis — RBAC-aware, MCP-based. Arch diagram incoming from work laptop.
MCP (Model Context Protocol)
- Installed many MCP servers
- Built a Graphia MCP for devs/infra engineers (same RBAC, usage tracking)
9. Security
Unibet.fr Scope Responsibility
- Vulnerability management: Followed up on Tenable/Nessus vulns and patches
- Architecture review: Checked PoCs and provided security inputs for new implementations (BIND, Packer, etc.)
- DDoS mitigation: Installed F5 Cloud as mitigation layer
- Network monitoring: Installed ExtraHop + alerts (DB dump detection, network anomalies)
10. Physical / Datacenter
Rack Installation
- Installed 4-5 full racks from scratch: network → server → SAN for virtualization
- Knows cabling, on-site intervention
NetBox + HA Architecture Design
- Managed NetBox as source of truth
- Designed HA Proxmox architectures with LACP/EVPN/VPC
11. Learning Gaps (self-identified)
- AWS / cloud depth
- BGP routing
- eBPF
- Full rack design best practices (A to Z)
- FinOps
- Formal SRE incident management processes
- Tracing (Grafana Alloy, OpenTelemetry) — did a little bit
12. Team & Leadership
Working Style
- 80% solo: Most projects owned end-to-end
- 20% team/project lead: For Proxmox and infrastructure projects
- Splits tasks, does knowledge transfer
- Uses RAPID framework for validation
- Runs Spike sessions before starting to scope properly
- Weekly meetings to check blockers and progress
- Still does ~50% of the hands-on work even when leading
13. Scale Numbers (Consolidated)
VM Fleet
- Peak: ~3000 VMs under management
- Migrations: Recreated most of them for cross-DC migrations (vSphere → Proxmox, then some → OpenStack, then OpenStack DC → OpenStack DC)
Datacenters
- Org scale: 30-40 DCs total, used 10 for replications (OpenStack, mostly usage not admin)
- Directly managed: 2 DCs / 8 racks at one point
- VINC era: 4 racks / 30 nodes
Ingestion (The Real Numbers)
- Loki: 8 TB/day
- Elasticsearch: 2 TB/day
- Thanos: 1 TB/day
- Combined: ~11 TB/day
14. War Stories
The Champions League Final Incident
- Context: Gambling platform — when a match settles (especially Champions League final), 10,000+ users connect within the SAME MINUTE. Business-driven traffic spike, not a DDoS, but hits like one.
- The change: Migrated firewall to nftables. Worked fine for 30 days.
- The incident: During a Champions League final, massive user spike. nftables conntrack table filled up → some users couldn’t connect, others could. TCP window times were extremely long.
- The fix: Killed all HAProxy and nftables, bumped nftables conntrack limits, set aggressive TCP windows and TTL values.
- How they found it: Grafana dashboard caught the anomaly.
- Lesson: Conntrack sizing matters for bursty traffic. Default values don’t survive gambling-scale spikes.
16. Migration Projects (Detail)
a. vSphere → Proxmox Migration
- Driver: vSphere licenses costly + EoL
- Approach: Installed Proxmox, converted/rebuilt VMs node by node from vSphere
- Storage: Pure Storage SAN with NVMe-oF (MultipathD)
- Backup: Found and integrated Proxmox Backup Server as part of the migration
b. Proxmox 4 → Proxmox 8 Upgrade
- Old cluster: 6 nodes, Proxmox 4
- New cluster: 4 nodes, Proxmox 8
- Challenge: OpenVZ containers to convert with zero or low downtime
- Solution: Used NFS as a buffer storage during the migration
c. OpenStack DC → OpenStack DC Migration
- Driver: Source DC got decommissioned
- Approach: Used Terraform to rebuild VMs, Ansible to install, then switched HAProxy backends to new DC bit by bit
d. Elasticsearch 200TB Cluster Migration
- Scope: 200TB ES cluster, DC to DC
- Approach: Rebuilt nodes 1 by 1 in the new DC, waited for shard rebalancing or forced shards to new nodes
17. Graphia — Pain Points Solved
The Problem
- Grafana dashboards require deep expertise to build well (“need a PhD”)
- Companies have 1000+ datasources — onboarding is slow
- Query languages (LogQL, PromQL, Splunk SPL) are a barrier
- Correlating metrics and logs across backends is manual and slow
- Jumping on an alert incident means starting from scratch every time
What Graphia Does
- Removes the need to be a dashboard expert
- Abstracts away datasource complexity (fast onboarding)
- Eliminates query language knowledge requirement
- Correlates data across metrics and logs
- Prepares a battle plan when jumping on an alert incident
- RBAC-aware: team members get access only to their permitted resources
- Usage tracking built in
18. Honesty Notes (from Sabri)
- The 11TB/day observability stack was TEAM work, not solo. Don’t oversell as “I built this alone.”
- Loki was mostly already there — his contribution was fine-tuning + Fluentd→Vector migration + recording rules
- He didn’t deploy 100% of the 3000 VMs himself — he BUILT THE TOOLING that enabled it
- Position as “operated and optimized” or “built the automation for” rather than “single-handedly ran”
- He’d be nervous taking on a solo “build me a new rack that ingests 10TB/day” mission
15. The “Why Freelance” Story
Kindred Acquisition
- Before Kindred acquired (specifically France): startup-like culture, 5 people in infra, 10 in dev. Things moved fast.
- After acquisition: all process and drama. Speed died.
- Took a voluntary departure plan (“Plan de départ volontaire”) — the financial runway made the exit possible.