Overview
Reliability for mobile is a shared responsibility between engineering (code quality, architectural resilience, release quality) and product (understanding that reliability investment competes with feature velocity, and error budget policy governs the trade-off). The error budget framework makes this trade-off explicit and quantified.
Service Level Objectives for Mobile
Define SLOs across four dimensions:
Availability SLO: Crash-free users rate > 99.5% (28-day rolling window). Measured by Firebase Crashlytics. This means at most 0.5% of active users experience a crash per 28-day period. For a 100,000 active user base, this is 500 users experiencing a crash — a meaningful quality bar.
Performance SLO: P75 cold start time < 2.5 seconds; P75 screen load time < 1.5 seconds; P90 API response time < 1.0 second. Measured by Firebase Performance Monitoring.
ANR SLO (Android): ANR rate < 0.2% across the active install base. Measured by Google Play Android Vitals.
Release Quality SLO: Change Failure Rate < 5% (percentage of releases requiring a hotfix or rollback within 48 hours of release).
Error Budget Policy
The error budget is the allowed unreliability within the SLO period. If the crash-free SLO is 99.5% and actual performance is 99.8%, 60% of the error budget remains. If actual performance is 99.2%, 60% of the error budget is consumed — triggering a reliability focus sprint.
Error budget policy: when error budget consumption exceeds 50% in a 28-day window, the next sprint allocates 50% engineering capacity to reliability work. When budget is exhausted (SLO violated), feature development pauses until the SLO is restored. This policy is agreed with the product organisation before the project launches — not negotiated during an incident.
Incident Response
Mobile incidents differ from backend incidents: the application is already deployed to user devices and cannot be rolled back instantly. Response options in order of speed: feature flag kill switch (immediate, if the failing feature is behind a flag), Play Store staged rollout halt (minutes — stops rollout, does not remove installed version), expedited App Store review for emergency hotfix (4-8 hours), standard hotfix release (24-48 hours for App Store).
Runbook for mobile incidents: (1) Identify the affected population using Crashlytics filtering. (2) Determine if the failure is behind a feature flag — toggle off if so. (3) Halt Play Store staged rollout if rollout is in progress. (4) Request expedited App Store review if iOS users are affected. (5) Fix, test, and submit hotfix. (6) Post-mortem within 5 business days.
Anti-Patterns to Avoid
⚠ 1. No SLOs Defined
Releasing a mobile application without defined reliability targets. Reliability discussions happen reactively after incidents rather than proactively before them.
Hover to see the fix ↻
↺ Correct Approach
SLOs defined and agreed with the product organisation during project inception. Monitored from the first production release. Error budget policy documented and enforced.
⚠ 2. Feature Velocity Always Wins
Engineering team always ships features regardless of error budget status. Reliability regressions accumulate silently until a critical incident.
Hover to see the fix ↻
↺ Correct Approach
Error budget policy enforced. When the budget is exhausted, feature development pauses. This creates the incentive to maintain reliability proactively rather than reactively.
Flowchart
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%%
flowchart TD
subgraph SLOs["🎯 Service Level Objectives"]
A["Availability SLO
Crash-free users > 99.5%
28-day rolling"]
P["Performance SLO
P75 cold start < 2.5s
P90 API response < 1.0s"]
Q["Release Quality SLO
Change Failure Rate < 5%
Hotfix within 48hr"]
end
subgraph Budget["💰 Error Budget Policy"]
G["Green: > 50% remaining
Normal feature velocity"]
Y["Yellow: 50% consumed
50% capacity to reliability work"]
R["Red: Budget exhausted
Feature dev pauses
Reliability sprint"]
end
subgraph IR["🚨 Incident Response"]
FF["Feature Flag Kill
Immediate"]
HR["Play Store Halt
Minutes"]
ER["Expedited Review
4-8 hours iOS"]
HF["Hotfix Release
24-48 hours"]
end
A & P & Q --> G
G --> Y --> R
R --> FF --> HR --> ER --> HF
style SLOs fill:#E3F2FD,stroke:#1565C0
style Budget fill:#FFF3E0,stroke:#E65100
style IR fill:#FFEBEE,stroke:#B71C1C
style R fill:#FFCDD2,stroke:#B71C1C
References
- Google — Site Reliability Engineering. sre.google/books
- Google — The SRE Workbook. sre.google/workbook/table-of-contents
- Beyer et al. — Site Reliability Engineering. O'Reilly, 2016.
- Google — Android Vitals. play.google.com/console/about/vitals
Mobile Engineering Reference
← Mobile Development