← All missions

You have dashboards nobody trusts, alerts that fire at 3 AM for no reason, and a Loki bill that keeps climbing. The stack works, technically. But it is not working for your team.

I operated an 11 TB/day observability stack across Loki (400 pods, 3TB RAM), Elasticsearch (200TB, 3 DCs), and Thanos (50TB, 200 nodes). My specific work: migrating Fluentd to Vector for performance, fine-tuning Loki with memcache clusters, building recording rules that converted TB/day of load-balancer logs into queryable metrics, and maintaining 12-month Thanos storage for ML forecasting.

This audit brings that experience to your stack. I map what you have, find what is broken or wasteful, and hand you a prioritized fix list with impact estimates.

What's included

  • Ingestion audit: pipeline mapping (Vector/Fluentd/Alloy), label cardinality analysis, identify hot paths and silent drops
  • Query performance: slow dashboard profiling, LogQL/PromQL optimization, indexing strategy review
  • Cost analysis: retention vs. value matrix, storage tiering, identify over-retention and idle data streams
  • Alert quality: alert noise reduction, SLO-based alerting design, eliminate flapping rules
  • Coverage gaps: identify critical services lacking proper observability coverage
  • Recording rules: convert high-volume log streams into pre-computed metrics for faster, cheaper querying

Deliverables

  • Audit report with prioritized findings, cost-impact estimates, and a 30/60/90-day remediation roadmap
  • Optimized Loki/Thanos configuration files with before/after benchmarks
  • SLO dashboard templates (error budget, burn rate, availability) ready to import
  • Recording rules library for common high-volume patterns (load-balancer logs, access logs)
  • 30-day handover period with Slack/Teams support for implementation questions

Tech stack

GrafanaLokiThanosVectorElasticsearchPrometheus