Observability Stack Audit

You have dashboards nobody trusts, alerts that fire at 3 AM for no reason, and a Loki bill that keeps climbing. The stack works, technically. But it is not working for your team.

I operated an 11 TB/day observability stack across Loki (400 pods, 3TB RAM), Elasticsearch (200TB, 3 DCs), and Thanos (50TB, 200 nodes). My specific work: migrating Fluentd to Vector for performance, fine-tuning Loki with memcache clusters, building recording rules that converted TB/day of load-balancer logs into queryable metrics, and maintaining 12-month Thanos storage for ML forecasting.

This audit brings that experience to your stack. I map what you have, find what is broken or wasteful, and hand you a prioritized fix list with impact estimates.

What's included

Ingestion audit: pipeline mapping (Vector/Fluentd/Alloy), label cardinality analysis, identify hot paths and silent drops
Query performance: slow dashboard profiling, LogQL/PromQL optimization, indexing strategy review
Cost analysis: retention vs. value matrix, storage tiering, identify over-retention and idle data streams
Alert quality: alert noise reduction, SLO-based alerting design, eliminate flapping rules
Coverage gaps: identify critical services lacking proper observability coverage
Recording rules: convert high-volume log streams into pre-computed metrics for faster, cheaper querying

Deliverables

Audit report with prioritized findings, cost-impact estimates, and a 30/60/90-day remediation roadmap
Optimized Loki/Thanos configuration files with before/after benchmarks
SLO dashboard templates (error budget, burn rate, availability) ready to import
Recording rules library for common high-volume patterns (load-balancer logs, access logs)
30-day handover period with Slack/Teams support for implementation questions

Tech stack

GrafanaLokiThanosVectorElasticsearchPrometheus