🏛 Library Observability Metrics
observability / metrics

Metrics

Measuring system health through quantified signals that tell you what's happening right now

TOGAF ADM NIST CSF ISO 27001 AWS Well-Arch Google SRE AI-Native
💡
In Plain English

Metrics are like the dashboard of a car — speed, fuel level, engine temperature. Without them you're driving blind. System metrics tell engineers how fast things are going, how much load the system is under, and whether something is about to break.

📈
Business Value

Google's SRE practice attributes their 99.99%+ uptime to metrics-driven operations. The USE Method (Utilization, Saturation, Errors) and RED Method (Rate, Errors, Duration) give engineers a structured way to detect problems in under 5 minutes. Without metrics, MTTR (Mean Time to Resolution) for incidents increases 3–5x.

📖 Detailed Explanation

Metrics are numerical measurements of system behavior sampled over time. Unlike logs (which record individual events) and traces (which track request journeys), metrics give you aggregate, real-time visibility into system health — the foundation of SRE practice at Google, Microsoft, and every high-availability operation.

The Four Golden Signals (Google SRE Book): Latency (how long requests take), Traffic (how many requests per second), Errors (rate of failed requests), and Saturation (how full the system is). These four signals, monitored together, surface virtually every category of production problem. If latency is rising, traffic is stable, errors are low, and CPU saturation is increasing — a CPU-bound performance regression has been deployed. If errors are rising and traffic is flat — a bug was deployed. If saturation is approaching 100% while traffic is rising — you need to scale.

The USE Method (Brendan Gregg) applies to resource-level metrics: Utilization (% of time the resource is busy), Saturation (amount of work the resource can't service, queued), Errors (count of error events). Apply USE to CPU, memory, network, disk I/O, and message queue depth. A DBA notices that database CPU is at 95% utilization (USE: Utilization) and query queue is growing (USE: Saturation) — a missing index is the likely cause.

The RED Method (Tom Wilkie) applies to service-level metrics: Rate (requests per second), Errors (failed request count or percentage), Duration (latency distribution). RED metrics answer "is this service healthy right now?" for every service in your mesh. The combination of USE (for infrastructure health) and RED (for service health) gives a complete operational picture.

Metric Types in Prometheus/OpenTelemetry: Counter (monotonically increasing: total requests served, total bytes written — never decreases, useful for rates), Gauge (current snapshot: memory usage, queue depth, active connections — can go up or down), Histogram (distribution of observations: request latency bucketed into ranges, enabling p50/p95/p99 calculation), Summary (pre-computed quantiles: p50, p95, p99 latency calculated client-side). For SLO-based alerting, histograms are preferred over summaries because they can be aggregated across multiple instances.

Cardinality is the most common cause of metrics platform cost explosions. Cardinality = the number of unique label value combinations. A metric `http_requests_total` with labels `{service, endpoint, method, status}` — if you have 50 services × 100 endpoints × 5 methods × 20 status codes = 500,000 unique time series. Prometheus struggles above 10 million active series. Solutions: limit high-cardinality labels (avoid user_id, request_id as metric labels), use logs for high-cardinality data, use recording rules to pre-aggregate expensive queries.

SLI/SLO Alignment: Metrics must be designed in alignment with your Service Level Indicators. If your SLO is "99.9% of payment API requests complete in under 2 seconds," your metrics must give you the data to calculate: (requests completing under 2s) / (total requests) × 100%. This requires a latency histogram with buckets that include the 2-second boundary. Design metrics for SLOs, not for debugging convenience.

OpenTelemetry is the CNCF standard for metrics (and traces and logs) instrumentation. It provides vendor-neutral SDKs that emit signals to any compatible backend (Prometheus, Datadog, New Relic, Grafana Cloud). Adopting OpenTelemetry from the start eliminates vendor lock-in in observability tooling — you can switch from Datadog to Grafana without re-instrumenting your services.

📈 Architecture Diagram

graph TB
    subgraph SERVICES["Application Services"]
        S1[Payment Service
RED Metrics]
        S2[Auth Service
RED Metrics]
        S3[Notification Service
RED Metrics]
    end

    subgraph INFRA["Infrastructure"]
        DB[(Database
USE Metrics)]
        MQ[Message Queue
USE Metrics]
        K8S[Kubernetes Nodes
USE Metrics]
    end

    subgraph COLLECTION["Metrics Collection"]
        OTEL[OpenTelemetry
Collector]
        PROM[Prometheus
Scrape + Store]
    end

    subgraph ALERTING["Alerting & Visualization"]
        AM[Alertmanager
SLO Breach Alerts]
        GRAF[Grafana
Dashboards]
        PD[PagerDuty
On-Call Routing]
    end

    S1 -->|OTLP| OTEL
    S2 -->|OTLP| OTEL
    S3 -->|OTLP| OTEL
    DB -->|Exporter| PROM
    MQ -->|Exporter| PROM
    K8S -->|node_exporter| PROM
    OTEL --> PROM
    PROM --> AM
    PROM --> GRAF
    AM -->|Alert| PD

    style SERVICES fill:#0f172a,color:#f8fafc
    style INFRA fill:#1e1b4b,color:#f8fafc
    style COLLECTION fill:#052e16,color:#f8fafc
    style ALERTING fill:#3b0764,color:#f8fafc

End-to-end metrics pipeline: services emit OpenTelemetry metrics, Prometheus scrapes and stores them, Alertmanager fires SLO breach alerts to PagerDuty, and Grafana provides dashboards for human visibility.

🌎 Real-World Examples

Google — Four Golden Signals Origin
Mountain View, USA · Cloud Platform · Trillion-scale metrics

Google's SRE Book defined the Four Golden Signals (Latency, Traffic, Errors, Saturation) as the minimal viable metric set for any production service. Their internal monitoring platform (Monarch) ingests trillions of data points per day. Error budgets derived from SLOs — derived from the Golden Signals — govern feature release velocity mathematically.

✓ Result: SLA violations reduced 78% after SLO-driven development; error budget framework adopted by 70% of CNCF projects

Netflix — Atlas at 10 Billion Metrics/Day
Los Gatos, USA · Video Streaming

Netflix's Atlas time-series database (open-sourced) handles 10 billion+ metrics/day. Key design decision: high-cardinality labels (userId, deviceId) filtered at ingestion — only aggregates stored. This prevents the 'cardinality explosion' that kills Prometheus at scale. Atlas influenced the OpenTelemetry metrics specification.

✓ Result: 10B+ metrics/day with < 1 second query latency; zero metrics system outages affecting incident response in 3 years

Uber — Jaeger Distributed Tracing
San Francisco, USA · Ride-hailing

Uber created Jaeger (CNCF graduated project) to trace requests across 4,000+ microservices. Adaptive sampling adjusts trace rate by traffic volume. A single trip generates traces spanning 20+ services. Uber's SRE team resolved a 6-hour production mystery in 4 minutes using trace comparison.

✓ Result: MTTR for distributed incidents: 45 minutes → 4 minutes; Jaeger now in 3,000+ organizations

Grab — Unified OpenTelemetry
Singapore · Super App · 35M MAU

Grab unified observability across ride-hailing, food, payments, and financial services using OpenTelemetry. Internal SRE portal shows real-time SLO burn rates for every team — enabling cross-team incident coordination. Grafana-as-a-Service provides standardized dashboards to 200+ teams.

✓ Result: Cross-team incident correlation: 25 minutes → 3 minutes; SLO visibility: 40% → 100% of production services

🌟 Core Principles

1
Four Golden Signals for Every Service

Every production service must emit Latency, Traffic, Errors, and Saturation metrics. These four signals, combined, surface virtually every class of production problem within minutes.

2
USE Method for Infrastructure

Every infrastructure component (CPU, memory, disk, network, queue) must expose Utilization, Saturation, and Errors metrics. USE metrics identify resource bottlenecks that golden signals alone may not pinpoint.

3
Design Metrics for SLO Measurement

Metrics are instrumentation for SLOs. Before writing metrics code, know your SLO. Design the metric labels, buckets, and cardinality to make the SLO calculation accurate and efficient.

4
Cardinality is a Cost

Every unique label combination creates a new time series. High-cardinality labels (user ID, request ID, IP address) are metrics system killers. Use labels for dimensions you will filter by in queries; use logs for high-cardinality per-request data.

⚙️ Implementation Steps

1

Define SLIs Before Instrumenting

Write down your SLO in measurable form: '99.9% of API requests complete in under 500ms.' Then design the metric that lets you calculate this: a request_duration_seconds histogram with 0.5s bucket boundary.

2

Instrument with OpenTelemetry

Use the OpenTelemetry SDK for your language. Emit counters for request counts, histograms for latency distributions, and gauges for resource utilization. Route to a Prometheus-compatible backend via OTLP exporter.

3

Build RED Dashboards per Service

In Grafana: create one dashboard per service with panels for Request Rate (req/s), Error Rate (%), and Duration (p50/p95/p99). Make these dashboards the first thing on-call engineers open during an incident.

4

Set SLO-Based Alerts

Configure alerts based on error budget burn rate (the approach recommended by Google SRE book), not raw thresholds. A burn rate alert fires when you're consuming your error budget too fast, giving advance warning before the SLO is actually breached.

5

Run Quarterly Metrics Hygiene

Audit your metrics cardinality quarterly. Remove unused metrics. Consolidate high-cardinality labels. Recording rules should pre-aggregate expensive queries. Keep Prometheus storage under 10M active series.

✅ Governance Checkpoints

CheckpointOwnerGate CriteriaStatus
Four Golden Signals InstrumentedPlatform TeamAll production services emit L/T/E/S metrics via OpenTelemetryRequired
Grafana RED Dashboards LiveSRE / PlatformPer-service RED dashboards published and linked in runbooksRequired
SLO Burn Rate Alerts ConfiguredSREError budget burn rate alerts firing correctly in stagingRequired
Cardinality Audit PassedPlatform TeamNo single metric exceeds 100k active time seriesQuarterly

◈ Recommended Patterns

✦ Error Budget Alerting

Alert when the error budget is being consumed at a rate that will exhaust it before the end of the SLO window. Two alert tiers: fast burn (consuming 5% error budget in 1 hour) and slow burn (consuming 10% error budget in 6 hours).

✦ Recording Rules

Pre-compute expensive PromQL expressions as recording rules that run at scrape time. This transforms expensive range queries (summing across hundreds of instances) into cheap lookups of pre-computed series.

✦ Exemplar-Linked Metrics

Associate individual trace exemplars with histogram metrics. When a p99 latency spike fires an alert, engineers can jump from the alerting metric directly to a representative slow trace. Supported by OpenTelemetry + Tempo + Grafana.

⛔ Anti-Patterns to Avoid

⛔ Alert on Every Metric

Configuring alerts for every metric that exceeds a static threshold creates alert storms that train engineers to ignore pages. Alert on symptoms (SLO breach, error rate spike) not causes (CPU > 70%). Alert fatigue is an existential risk to on-call programs.

⛔ Metrics Without Dashboards

Collecting metrics that no one has built a dashboard for. These metrics are theoretical — no one knows how to read them or what threshold indicates a problem. Every metric used in alerting must have a corresponding dashboard panel.

🤖 AI Augmentation Extensions

🤖 AIOps Anomaly Detection

ML models trained on historical metric patterns detect anomalies in real time — catching subtle degradations (a 5% p99 latency increase that's still within thresholds but represents a genuine regression) that static threshold alerts miss.

⚡ AIOps anomaly detection generates false positives for the first 30 days while learning normal patterns. Don't route AIOps alerts to PagerDuty until the false positive rate is below 5%.

🔗 Related Sections

📚 References & Further Reading