RetakeData

Observability Stack Audit

Mon, 01 Jan 0001 00:00:00 +0000

You have dashboards nobody trusts, alerts that fire at 3 AM for no reason, and a Loki bill that keeps climbing. The stack works, technically. But it is not working for your team.

I operated an 11 TB/day observability stack across Loki (400 pods, 3TB RAM), Elasticsearch (200TB, 3 DCs), and Thanos (50TB, 200 nodes). My specific work: migrating Fluentd to Vector for performance, fine-tuning Loki with memcache clusters, building recording rules that converted TB/day of load-balancer logs into queryable metrics, and maintaining 12-month Thanos storage for ML forecasting.

VM Migration Factory

Mon, 01 Jan 0001 00:00:00 +0000

Most teams operating hundreds of VMs do it manually: ticket, provision, configure, register, repeat. Every VM is slightly different. Every migration is a project. Every provider has its own quirks that someone has to remember.

This mission replaces that with a pipeline. I built this exact system to manage 3000+ VMs across 4 providers and 10 datacenters, without Kubernetes. You open a PR with a tfvars file, merge it, and the VM goes from nothing to configured, monitored, and receiving traffic.

Proxmox / Ceph HA Platform

Mon, 01 Jan 0001 00:00:00 +0000

VMware licensing changes pushed a lot of teams to look for alternatives. Proxmox is the answer, but a Proxmox cluster that survives real failure scenarios needs proper Ceph design, network segmentation, quorum tuning, and automated provisioning.

I deployed 100+ Proxmox nodes with PXE automation across every storage backend: ZFS, NFS, SAN, NVMe over Fabric, and Ceph. I led vSphere-to-Proxmox migrations (Pure Storage SAN, NVMe-oF, MultipathD), Proxmox 4-to-8 upgrades with near-zero downtime using NFS buffer, and designed HA architectures with LACP/EVPN/VPC.

On-Prem AI for Operations

Mon, 01 Jan 0001 00:00:00 +0000

Every team wants AI-assisted operations. Not every team can send their logs, metrics, and incident data to OpenAI. If you operate under GDPR constraints, data sovereignty requirements, or strict security policies, on-prem AI is not optional.

I built this exact setup: 6 GPUs across 2 servers in the datacenter, vLLM serving, a private RAG pipeline over 1000+ PDFs with PostgreSQL/pgvector, and Graphia, an RBAC-aware SRE agent that lets engineering teams query their Grafana infrastructure without needing to know LogQL or PromQL.

Building SSHplex: A Modern TUI for SSH Connection Multiplexing

Mon, 09 Jun 2025 00:00:00 +0000

The Problem

At Kindred, we relied on Remote Desktop Manager (RDM) to manage connections to our Windows and Linux hosts for broadcasting commands and checking system states. However, licensing costs were high and every new host required manual database entry. After finding no suitable alternatives, I decided to build my own solution.

Solution Design

SSHplex needed three core capabilities: a modern terminal UI with host selection and bulk operations, flexible data source integration (NetBox and Ansible inventory), and terminal multiplexer support with session persistence for background tasks.

Building SSHplex: More details

Mon, 09 Jun 2025 00:00:00 +0000

The Problem

At Kindred, we relied on Remote Desktop Manager (RDM) to manage connections to our Windows and Linux hosts. I primarily used it to connect to multiple VMs simultaneously and broadcast commands to check system states or run quick commands where Ansible ad-hoc was either too slow or when I needed immediate feedback.

However, we faced two major issues:

Licensing costs: The license was expiring and renewal was expensive
Maintenance overhead: Every new host had to be manually added to the RDM SQL Server database

After searching for alternatives, I found nothing that met our specific needs. So I decided to build my own solution.

AI Transformed My Journey as a System Engineer: Developing a Terraform Provider for Centreon

Tue, 25 Feb 2025 00:00:00 +0000

As a day-to-day Terraform user with a decent foundation in Python, I never imagined that developing a Terraform provider would significantly impact my system engineering skills. Yet, leveraging AI tools enabled me to build a provider for Centreon API V2 and step into the Go ecosystem—an essential leap for my work at Kindred.

Overview

For years, there was a significant gap in available tools: the only existing Centreon Terraform provider was built around the legacy CLAPI, which had not been updated in over five years. While there was also a V1 (distinct from CLAPI), it lacked the features needed for modern infrastructure management. My need for an up-to-date solution at Kindred pushed me to create a new provider based on the latest Centreon API V2, ensuring future-proof functionality and seamless integration with current workflows.

Hello World

Mon, 24 Feb 2025 00:00:00 +0000

Welcome to my blog! I’m a French systems engineer with a long-standing passion for systems, security, and networking that dates back to my younger years. What started as curiosity has evolved into a fulfilling career and continuous learning journey.

About Me

I’ve built my career around understanding and implementing robust system architectures, but I believe there’s always room to grow. Recently, I’ve been diving deeper into programming with a particular focus on Go and Python. Despite being what some might call a “late learner” in the programming world, I’m determined to master these skills to complement my systems expertise.

About

Wed, 01 Jan 2025 00:00:00 +0000

SMJED is the independent infrastructure engineering practice of Sabri MJAHED, SRE with 10+ years across sysadmin and platform engineering roles. We build on-prem, private-cloud, and hybrid platforms for teams that need full control over their infrastructure and their data — not a dependency on someone else’s cloud.

The approach

Generalists by design. Real infrastructure problems don’t stay inside one specialty. They cross observability, automation, storage, networking, and security in ways that require someone who can operate across all of them. Rack and cable on Monday, debug a Loki ingestion bottleneck on Tuesday, architect a Proxmox HA cluster on Wednesday.

Mon, 01 Jan 0001 00:00:00 +0000

Skills Context — Sabri’s Real Experience

Working doc for posts, missions, and narrative refinement. NOT part of the Hugo build. Reference-only.

1. Observability

Grafana

Scale: 2000 users across 10 instances
Deployment: Both on-prem (deb packages + DBs) and full Docker via Helm
Datasources: Elasticsearch, VictoriaLogs, Splunk, Loki, Prometheus metrics
Dashboarding: Heavy transformation work across multiple datasources
Tooling built: “Grafana Housekeeping” — gathers all resources (users, dashboards, alerts, contact points, datasources), checks for stale/broken/unused, sends reporting to Jira for manager-driven cleanups
MCP: Used Grafana MCP

Prometheus / Thanos

Scale: Thanos cluster with global querier (to avoid having scattered Thanos instances in Grafana)
Topology: Storage gateway cache + read gateway cache across 4 different clusters
Storage: ~50TB managed, long-term (12-month) bucket
Nodes: ~200 replica nodes (8GB RAM, 3 CPU each)
Project focus: Maintaining and adding long-term storage for ML toolbox forecasting

Loki

Scale: 4 clusters, ~400 pods total, ~3TB RAM + significant CPU
Cache: 2TB memcache cluster for 24h hot storage
Work: Fine-tuning, installing cluster cache, experimenting with metric-splitting configurations
Recording rules: Converted TB/day load-balancer logs into metrics

Vector

Migration: Migrated from Fluentd to Vector due to Fluentd performance issues
Scope: Full pipeline migration + log transformations

ELK

Full stack: Beats, Kafka, Logstash (heavy transformations, also shipped to Splunk HEC), Elasticsearch, APM + RUM
Scale: 200TB across 3 datacenters, high-availability setup
Users: Team of 20 developers
Compliance: Different retention policies for gambling regulator purposes

2. Kubernetes & Platform

Kubernetes

Level: Primarily user-level (not cluster administrator), operates apps via CI/CD
Workflow: Jenkins + Makefile → build → push to JFrog → merged and deployed via ArgoCD
Helm: Created Helm charts for own apps (Graphia, search query exporter, Grafana Housekeeping)
Apps deployed: Graphia (SRE agent), search query exporter, Grafana Housekeeping tool

Search Query Exporter (app built at FDJ)

Purpose: Measure real user-experience search performance latency for SLO/SLIs
How it works: Queries Splunk, Thanos, and Loki across multiple time ranges (10m, 1h, 6h, 24h, 7d, 30d)
Output: Checks how the platform responds over time, compares against SLO budgets, flags when too slow
Design: Easily configurable, multi-backend querying

3. CI/CD

TODO: Still need detail on Jenkins, GitLab CI, GitHub Actions specifically

4. Automation & IaC

PXE / Proxmox

Automated PXE installation for Proxmox nodes using Ansible

Full VM Delivery Pipeline (“Kubernetes without Kubernetes”)

The complete flow, end to end:

My Resume

Mon, 01 Jan 0001 00:00:00 +0000

Summary

Senior freelance SRE with 10 years of experience operating infrastructure that can’t quietly fail. I work across high-volume observability, Kubernetes platform engineering, on-prem Proxmox/Ceph clusters, and AI-assisted SRE tooling.

Download CV PDF - English Download CV PDF - French

Open Source Contributions

🌟 SSHplex

Built and maintained an open source terminal UI for SSH connection multiplexing, designed for infrastructure teams that need fast host discovery, bulk operations, and persistent sessions.

GitHub Repository: SSHPlex
Blog Post: Building SSHplex
Combines NetBox, Ansible, Consul, and static lists as sources of truth for hosts and devices
Supports three mux backends: tmux standalone, tmux + iTerm2, and native iTerm2 on macOS
Provides broadcast commands and persistent sessions to replace expensive legacy tooling

Experience

Kindred France | Site Reliability Engineer | 2021 - Present

Kubernetes Grafana Loki Thanos Vector Jenkins GitLab CI Terraform

Progressed from System Engineer to Site Reliability Engineer, shifting focus from infrastructure automation toward platform reliability, observability, diagnostics and performance.
Operate observability workflows around Kubernetes with Thanos, Loki, Grafana, and Vector as core technologies.
Built a HouseKeeping tool to diagnose stale and broken Grafana resources, reducing dashboard/config drift and improving platform hygiene.
Built a Search Query Exporter to diagnose query slowness and establish SLOs across Thanos and Loki.
Designed an SLO Dashboard Framework to standardize service-level visibility and make reliability reporting easier to adopt across teams.
Building Graphia, a domain-specific SRE agent for Grafana diagnosis - RBAC-aware behavior, MCP-based diagnosis flows, and safeguards for enterprise operations.
Daily hands-on work with Helm charts, Argo CD, container image lifecycle, Jenkins, GitLab, AWS CloudWatch, and CUR2 cost analysis.

Current stack and ownership

Area	Components/Tools
Observability	Grafana, Loki, Thanos, Vector, AWS CloudWatch
Platform Engineering	Kubernetes, Helm, Argo CD, Container Images
CI/CD & Automation	Jenkins, GitLab CI, Terraform, Ansible
Data & Storage	Kafka, Redis, PostgreSQL, Microsoft SQL, Couchbase
Programming & AI	Go, Python, Bash, AI, MCP

Previous impact within the same company

Led the automated deployment of VMs and applications through CI/CD, enabling multiple deployments per day.
Used Terraform to deploy across 10 datacenters and 4 providers (OpenStack, Proxmox, vSphere, NetBox) from shared templates.
Used Ansible for VM initialization and application deployment, with Consul feeding service pools for HAProxy and Prometheus.
Operated multi-cluster observability at multi-TB/day ingestion across logs, metrics, and traces, with Kafka pipelines feeding SIEM, logging, EDR, APM, and uptime monitoring.
Integrated a highly available Proxmox cluster across 4 racks and 2 datacenters with Ceph, including PXE-based automation and 25 Gb networking per host.
Accountable for the French security scope, driving remediation work for vulnerabilities and production hardening.

VINC | System engineer | 2019 - 2021

Proxmox DNS High Availability

Architected the new platform with new BGP routers and firewalls.
Managed Proxmox cluster across 2 datacenters.
Responsible for SLA and client communication during production incidents.
Implemented websites around client needs.
Implemented a new DNS stack with high availability in mind.

Multi-Visp / Azuria | System administrator | 2017 - 2019

Network Infrastructure VPN Datacenter Management WiFi

Installed complete new racks in Telehouse2.
Cable management between two rooms.
Installed managed Wi-Fi equipment.
Implemented high-availability multi-datacenter VPN services.

Contact Information

Email: contact@smjed.net

Open Source

Mon, 01 Jan 0001 00:00:00 +0000

I treat open source as an extension of my SRE and platform engineering work: build tools around real operational pain points, keep them practical, and make them useful beyond my own environment.

Technical Expertise

Mon, 01 Jan 0001 00:00:00 +0000

As a senior freelance SRE, my work spans high-volume observability (11 TB/day across Loki, Elasticsearch, and Thanos), infrastructure automation (the VM delivery pipeline that managed 3000+ VMs), on-prem Proxmox/Ceph at scale (100+ nodes), and AI-assisted SRE tooling (Graphia, GPU serving, RAG pipelines). The stack below reflects what I run in production today.

Observability & Reliability: Grafana (2000 users, 10 instances), Loki, Thanos, Vector, Elasticsearch, CloudWatch. SLO-oriented diagnostics, recording rules, and platform hygiene at gambling scale.
Infrastructure Automation: Terraform multi-provider modules (vSphere, Proxmox, OpenStack), Ansible, Consul, NetBox. I build the tooling that lets small teams operate at fleet scale.
On-prem & Virtualization: Proxmox (100+ nodes, PXE automated), Ceph, ZFS, NFS, SAN, NVMe over Fabric. vSphere migrations, HA cluster design, multi-DC networking.
AI-Assisted Tooling: Graphia (SRE agent for Grafana), vLLM serving, RAG pipelines, MCP integration. RBAC-aware, built for real operations.

Below is a structured view of the technologies I use most: