Mission 01
Observability Stack Audit
Find the blind spots, the cost leaks, and the alert fatigue before they find you.
You have dashboards nobody trusts, alerts that fire at 3 AM for no reason, and a Loki bill that keeps climbing. The stack works, technically. But it is not working for your team.
I operated an 11 TB/day observability stack across Loki (400 pods, 3TB RAM), Elasticsearch (200TB, 3 DCs), and Thanos (50TB, 200 nodes). My specific work: migrating Fluentd to Vector for performance, fine-tuning Loki with memcache clusters, building recording rules that converted TB/day of load-balancer logs into queryable metrics, and maintaining 12-month Thanos storage for ML forecasting.
This audit brings that experience to your stack. I map what you have, find what is broken or wasteful, and hand you a prioritized fix list with impact estimates.
What's included
- Ingestion audit: pipeline mapping (Vector/Fluentd/Alloy), label cardinality analysis, identify hot paths and silent drops
- Query performance: slow dashboard profiling, LogQL/PromQL optimization, indexing strategy review
- Cost analysis: retention vs. value matrix, storage tiering, identify over-retention and idle data streams
- Alert quality: alert noise reduction, SLO-based alerting design, eliminate flapping rules
- Coverage gaps: identify critical services lacking proper observability coverage
- Recording rules: convert high-volume log streams into pre-computed metrics for faster, cheaper querying
Deliverables
- Audit report with prioritized findings, cost-impact estimates, and a 30/60/90-day remediation roadmap
- Optimized Loki/Thanos configuration files with before/after benchmarks
- SLO dashboard templates (error budget, burn rate, availability) ready to import
- Recording rules library for common high-volume patterns (load-balancer logs, access logs)
- 30-day handover period with Slack/Teams support for implementation questions