Mobile Reliability Engineering

On This Page

1	Overview	2	Service Level Objectives for Mobile
3	Error Budget Policy	4	Incident Response
5	Anti-Patterns to Avoid	6	References

Overview

Reliability for mobile is a shared responsibility between engineering (code quality, architectural resilience, release quality) and product (understanding that reliability investment competes with feature velocity, and error budget policy governs the trade-off). The error budget framework makes this trade-off explicit and quantified.

Service Level Objectives for Mobile

Define SLOs across four dimensions:

Availability SLO: Crash-free users rate > 99.5% (28-day rolling window). Measured by Firebase Crashlytics. This means at most 0.5% of active users experience a crash per 28-day period. For a 100,000 active user base, this is 500 users experiencing a crash — a meaningful quality bar.

Performance SLO: P75 cold start time < 2.5 seconds; P75 screen load time < 1.5 seconds; P90 API response time < 1.0 second. Measured by Firebase Performance Monitoring.

ANR SLO (Android): ANR rate < 0.2% across the active install base. Measured by Google Play Android Vitals.

Release Quality SLO: Change Failure Rate < 5% (percentage of releases requiring a hotfix or rollback within 48 hours of release).

Error Budget Policy

The error budget is the allowed unreliability within the SLO period. If the crash-free SLO is 99.5% and actual performance is 99.8%, 60% of the error budget remains. If actual performance is 99.2%, 60% of the error budget is consumed — triggering a reliability focus sprint.

Error budget policy: when error budget consumption exceeds 50% in a 28-day window, the next sprint allocates 50% engineering capacity to reliability work. When budget is exhausted (SLO violated), feature development pauses until the SLO is restored. This policy is agreed with the product organisation before the project launches — not negotiated during an incident.

Incident Response

Mobile incidents differ from backend incidents: the application is already deployed to user devices and cannot be rolled back instantly. Response options in order of speed: feature flag kill switch (immediate, if the failing feature is behind a flag), Play Store staged rollout halt (minutes — stops rollout, does not remove installed version), expedited App Store review for emergency hotfix (4-8 hours), standard hotfix release (24-48 hours for App Store).

Runbook for mobile incidents: (1) Identify the affected population using Crashlytics filtering. (2) Determine if the failure is behind a feature flag — toggle off if so. (3) Halt Play Store staged rollout if rollout is in progress. (4) Request expedited App Store review if iOS users are affected. (5) Fix, test, and submit hotfix. (6) Post-mortem within 5 business days.

Anti-Patterns to Avoid

⚠ 1. No SLOs Defined

Releasing a mobile application without defined reliability targets. Reliability discussions happen reactively after incidents rather than proactively before them.

Hover to see the fix ↻

↺ Correct Approach

SLOs defined and agreed with the product organisation during project inception. Monitored from the first production release. Error budget policy documented and enforced.

⚠ 2. Feature Velocity Always Wins

Engineering team always ships features regardless of error budget status. Reliability regressions accumulate silently until a critical incident.

Hover to see the fix ↻

↺ Correct Approach

Error budget policy enforced. When the budget is exhausted, feature development pauses. This creates the incentive to maintain reliability proactively rather than reactively.

Flowchart

%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'Inter, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%% flowchart TD subgraph SLOs["🎯 Service Level Objectives"] A["Availability SLO Crash-free users > 99.5% 28-day rolling"] P["Performance SLO P75 cold start < 2.5s P90 API response < 1.0s"] Q["Release Quality SLO Change Failure Rate < 5% Hotfix within 48hr"] end subgraph Budget["💰 Error Budget Policy"] G["Green: > 50% remaining Normal feature velocity"] Y["Yellow: 50% consumed 50% capacity to reliability work"] R["Red: Budget exhausted Feature dev pauses Reliability sprint"] end subgraph IR["🚨 Incident Response"] FF["Feature Flag Kill Immediate"] HR["Play Store Halt Minutes"] ER["Expedited Review 4-8 hours iOS"] HF["Hotfix Release 24-48 hours"] end A & P & Q --> G G --> Y --> R R --> FF --> HR --> ER --> HF style SLOs fill:#E3F2FD,stroke:#1565C0 style Budget fill:#FFF3E0,stroke:#E65100 style IR fill:#FFEBEE,stroke:#B71C1C style R fill:#FFCDD2,stroke:#B71C1C

References

Google — Site Reliability Engineering. sre.google/books
Google — The SRE Workbook. sre.google/workbook/table-of-contents
Beyer et al. — Site Reliability Engineering. O'Reilly, 2016.
Google — Android Vitals. play.google.com/console/about/vitals

Mobile Engineering Reference

← Mobile Development