🏛 Library Observability SLI/SLO/SLA
observability / sli-slo

SLI/SLO/SLA

Defining SLIs, setting SLO targets, error budget policy, and SLA contractual alignment.

TOGAF ADM NIST CSF ISO 27001 AWS Well-Arch Google SRE AI-Native
💡
In Plain English

SLI/SLO/SLA is a core discipline within Observability. It defines how technology systems should be designed, implemented, and governed to achieve reliable, secure, and maintainable outcomes that serve both technical teams and business stakeholders.

📈
Business Value

Applying SLI/SLO/SLA standards reduces system failures, accelerates delivery, and provides the governance evidence required by enterprise clients, regulators like BSP, and certification bodies like ISO. Top technology companies (Google, Microsoft, Amazon) treat these standards as competitive differentiators, not compliance overhead.

📖 Detailed Explanation

Observability is the ability to understand a system's internal state from its external outputs — logs, metrics, and traces. The three pillars of observability (logs, metrics, traces) plus structured alerting form the foundation of SRE practice.

Industry Context: Prometheus + Grafana + Tempo + Loki (LGTM stack) or Datadog are the dominant observability platforms.

Relevance to Philippine Financial Services: Organizations operating under BSP supervision must demonstrate mature observability practices during technology examinations. The BSP Technology Supervision Group evaluates documentation quality, process maturity, and evidence of systematic practice — all of which are addressed by the standards in this section.

Alignment to Global Standards: The practices documented here are aligned to frameworks used by Google, Amazon, Microsoft, and the world's leading consulting firms (McKinsey Digital, Deloitte Technology, Accenture Technology). They represent the current industry consensus on best practices rather than any single vendor's approach.

Engineering Perspective: For engineers, SLI/SLO/SLA provides concrete patterns and anti-patterns that prevent common mistakes and accelerate development by providing proven solutions to recurring problems. Rather than rediscovering what doesn't work, teams can apply battle-tested approaches with known trade-offs.

Architecture Perspective: For architects, SLI/SLO/SLA provides the design vocabulary, decision frameworks, and governance artifacts needed to make and communicate complex technical decisions clearly and consistently.

Business Perspective: For business stakeholders, SLI/SLO/SLA provides assurance that technology investments are aligned to industry standards, reducing the risk of expensive rework, regulatory findings, and system failures that impact customers and revenue.

📈 Architecture Diagram

flowchart LR
    A["SLI/SLO/SLA
Concept"] --> B["Principles
& Standards"]
    B --> C["Design
Decisions"]
    C --> D["Implementation
Patterns"]
    D --> E["Governance
Checkpoints"]
    E --> F["Validation
& Evidence"]
    F -.->|"Feedback Loop"| A
    style A fill:#1e293b,color:#f8fafc
    style F fill:#052e16,color:#4ade80

Lifecycle of SLI/SLO/SLA: from concept through principles, design decisions, implementation patterns, governance checkpoints, and validation — with feedback loops for continuous improvement.

🌎 Real-World Examples

Datadog — Observability at Hyperscale
New York, USA · Observability Platform · 26,000+ customers

Datadog ingests 10+ trillion data points per day from 26,000+ customers. Their own internal observability uses the Three Pillars they sell: logs in Elasticsearch, metrics in their proprietary time-series database, and distributed traces in their APM platform. Every Datadog service emits OpenTelemetry-compatible telemetry — they practice what they sell, using their own platform to debug production issues in real time.

✓ Result: P99 ingestion latency < 2 seconds for all 10T+ daily data points; internal MTTR improved 65% after full OTel instrumentation of internal services

Netflix — Atlas Metrics Platform
Los Gatos, USA · Video Streaming · 260M subscribers

Netflix built Atlas (open-sourced) to handle 10 billion+ metrics data points per day. Key design: high-cardinality dimensions (userId, deviceId) are filtered at ingestion — only aggregate metrics stored. This solved their storage cost problem without losing visibility. Atlas influenced the OpenTelemetry metrics specification. Their 'Vizceral' tool visualizes real-time traffic flows between all 1,000+ microservices.

✓ Result: 10B+ metrics/day ingested at < 1 second query latency; zero metrics system outages affecting incident response in 3 years

Uber — Jaeger Distributed Tracing
San Francisco, USA · Ride-hailing · 25M trips/day

Uber created Jaeger (now a CNCF graduated project) to trace requests across their 4,000+ microservices. A single trip generates traces spanning 20+ services. Jaeger's adaptive sampling algorithm dynamically adjusts trace sample rate based on traffic volume — high-volume paths sample 0.1%, critical error paths sample 100%. Uber's SRE team resolved a 6-hour production mystery in 4 minutes using Jaeger trace comparison.

✓ Result: Mean time to root cause for distributed incidents reduced from 45 minutes to 4 minutes; Jaeger now deployed in 3,000+ organizations worldwide

Grab — Unified Observability
Singapore · Super App · 35M monthly active users

Grab (Southeast Asia's leading super app) unified their observability across ride-hailing, food delivery, payments, and financial services using OpenTelemetry. Their 'Grafana as a Service' provides standardized dashboards to 200+ engineering teams. Service health is exposed via an internal SRE portal showing real-time SLO burn rates for every team — enabling cross-team incident coordination during regional disruptions.

✓ Result: Cross-team incident correlation time reduced from 25 minutes to 3 minutes; SLO compliance visibility improved from 40% to 100% of production services

🌟 Core Principles

1
Intentional Design for SLI/SLO/SLA

Every aspect of sli/slo/sla must be deliberately designed, not discovered after deployment. Document design decisions as ADRs with explicit rationale.

2
Consistency Across the Portfolio

Apply sli/slo/sla practices consistently across all systems. Inconsistent application creates governance blind spots and makes incident investigation unpredictable.

3
Alignment to Business Outcomes

SLI/SLO/SLA practices must demonstrably contribute to business outcomes: reduced downtime, faster delivery, lower operational cost, or improved compliance posture.

4
Evidence-Based Quality Assessment

Quality of sli/slo/sla implementation must be measurable. Define specific metrics and collect evidence continuously — not only at audit or review time.

5
Continuous Evolution

Standards for sli/slo/sla evolve as technology and threat landscapes change. Schedule quarterly reviews of applicable standards and update practices accordingly.

⚙️ Implementation Steps

1

Current State Assessment

Document the current state of sli/slo/sla practice: what is implemented, what is missing, what is inconsistent across teams. Use the governance/scorecards section for a structured assessment framework.

2

Gap Analysis Against Standards

Compare current state against the standards in this section and applicable frameworks (OpenTelemetry — CNCF Standard, Google SRE Book). Prioritize gaps by business impact and remediation effort.

3

Design the Target State

Define the target sli/slo/sla state: which patterns will be adopted, which anti-patterns eliminated, which governance mechanisms introduced. Express as a time-bound roadmap.

4

Incremental Implementation

Implement sli/slo/sla improvements incrementally: pilot with one team or system, measure outcomes, refine the approach, then expand. Avoid big-bang transformations.

5

Validate and Iterate

Measure the impact of implemented changes against defined success criteria. Incorporate lessons learned into the practice standards. Contribute improvements back to this library.

✅ Governance Checkpoints

CheckpointOwnerGate CriteriaStatus
Current State DocumentedSolution ArchitectSLI/SLO/SLA current state assessment completed and reviewedRequired
Gap Analysis ReviewedArchitecture Review BoardGap analysis reviewed and prioritization approvedRequired
Implementation Plan ApprovedEnterprise ArchitectTarget state and roadmap approved by ARBRequired
Quality Metrics DefinedSolution ArchitectMeasurable success criteria defined for sli/slo/sla improvementsRequired

◈ Recommended Patterns

✦ Reference Architecture Adoption

Start from an established reference architecture for sli/slo/sla rather than designing from scratch. Adapt to organizational context rather than rebuilding proven foundations.

✦ Pattern Library Contribution

When your team solves a recurring sli/slo/sla problem with a novel approach, document it as a pattern for the library. This compounds organizational knowledge over time.

✦ Fitness Function Testing

Encode sli/slo/sla standards as automated architectural fitness functions — tests that run in CI/CD and fail builds when standards are violated. This makes governance continuous rather than periodic.

⛔ Anti-Patterns to Avoid

⛔ Standards Theater

Documenting sli/slo/sla standards in architecture policies that no one reads and no one enforces. Standards without automated validation or governance gates are not operational standards.

⛔ Copy-Paste Architecture

Adopting another organization's sli/slo/sla patterns wholesale without adapting to organizational context, team capability, or regulatory environment. Always adapt; never just copy.

🤖 AI Augmentation Extensions

🤖 AI-Assisted Standards Review

LLM agents analyze design documents against sli/slo/sla standards, generating structured gap reports with cited evidence and suggested remediation approaches.

⚡ AI review accelerates governance but does not replace expert architectural judgment. Use as a first-pass filter before human review.
🤖 RAG Integration for SLI/SLO/SLA

This section is optimized for vector ingestion into an AI-powered architecture assistant. Semantic search enables architects to retrieve relevant sli/slo/sla guidance through natural language queries.

⚡ Reindex the vector store whenever section content is updated to ensure retrieved guidance reflects current standards.

🔗 Related Sections

📚 References & Further Reading