Overview
On-device inference processes data locally without network round-trips. Cloud inference sends data to a server for processing. The choice is not binary — many applications use on-device for latency-sensitive features and cloud for complex models that cannot fit in device memory. The architectural principle: minimise the data that leaves the device for AI/ML processing, especially when that data contains PII, biometrics, or health information.
On-Device Inference
Apple Core ML
Core ML runs trained models on Apple Neural Engine (ANE) — available on A11 Bionic and later, delivering 15-38 TOPS depending on chip generation. Models converted to .mlmodel format using coremltools Python library from PyTorch, TensorFlow, or ONNX source. Core ML Compiler optimises the model for the target hardware at build time. Privacy benefit: no data leaves the device. Latency: sub-100ms for most models on ANE. Limitation: model size constrained by device storage and memory.
Use cases: document OCR (Vision framework), face detection and recognition (Vision + Core ML), language detection (Natural Language framework), on-device text generation (Apple Intelligence on A17 Pro and M-series), personalised recommendations (Create ML trained on-device with private user data).
TensorFlow Lite and Google ML Kit
TensorFlow Lite converts TensorFlow models to .tflite format, optimised for mobile inference through quantisation (reducing model precision from float32 to int8 — 4× size reduction with minimal accuracy loss). GPU delegate for hardware acceleration on Qualcomm Adreno and ARM Mali GPUs. NNAPI delegate for Qualcomm, MediaTek, and Samsung NPUs on Android.
Google ML Kit provides pre-built, optimised on-device models: text recognition (Latin and Chinese scripts), face detection, barcode scanning, pose estimation, object detection, language identification, translation. Zero model management — ML Kit handles model downloads and updates. For custom models, ML Kit's Custom Models API hosts and serves .tflite models with OTA updates.
Gemini Nano (On-Device LLM)
Google Gemini Nano runs locally on Pixel 8+ and select Android 14+ devices via Android AICore. Provides summarisation, smart reply, and proofreading capabilities without network access. Access via Android Gemini SDK with capability checking (not all devices have Gemini Nano).
Apple Intelligence provides on-device foundation model inference on iPhone 15 Pro (A17 Pro) and iPhone 16 series, iPad Pro with M-series chips. Privacy-preserving: user context processed on device for most features; complex requests routed to Private Cloud Compute with cryptographic guarantees.
Cloud Inference
Cloud inference is appropriate for: models too large for device memory (>100MB for mobile), frequent model updates required without app releases, complex reasoning requiring large LLM context windows, and features where network latency is acceptable (non-real-time analysis).
Architecture pattern: requests processed at the BFF layer — the mobile app sends a structured request, the BFF calls the ML service, returns a structured response. The mobile app never communicates directly with the ML inference endpoint. This maintains the BFF as the single integration boundary and enables model changes without mobile app releases.
Privacy Considerations for Mobile AI
On-device AI is the privacy-preferred architecture. Data used for inference is never transmitted. Data cannot be intercepted in transit. PDPA Philippines, GDPR, and HIPAA all favour on-device processing for sensitive data categories.
When cloud inference is required for regulated data (PHI, financial records, biometrics): data minimisation (send only the minimum required for inference, not the full record), encryption in transit (TLS 1.3), transient processing (no persistent storage of inference inputs on cloud infrastructure), explicit user consent.
Anti-Patterns to Avoid
⚠ 1. Sending Raw Biometrics to Cloud for Verification
Transmitting facial images or fingerprint data to a cloud API for identity verification. Creates a biometric data repository that is a high-value breach target and may violate biometric data regulations.
Hover to see the fix ↻
↺ Correct Approach
On-device biometric verification using platform APIs (Face ID/Touch ID on iOS, BiometricPrompt on Android). Biometrics never leave the device. The platform's secure enclave handles comparison.
⚠ 2. Blocking UI for Model Inference
Running a TensorFlow Lite model inference on the main thread, producing UI lag during document scanning or image classification.
Hover to see the fix ↻
↺ Correct Approach
Inference runs on a background thread (Dispatchers.Default on Android, background Task on iOS). Results published to the UI through StateFlow/ObservableObject when complete.
Flowchart
%%{init:{'theme':'base','themeVariables':{'fontSize':'14px','fontFamily':'IBM Plex Sans, system-ui, sans-serif','primaryColor':'#DBEAFE','primaryTextColor':'#1e3a5f','primaryBorderColor':'#2563EB','lineColor':'#374151','clusterBkg':'#F9FAFB','clusterBorder':'#D1D5DB','edgeLabelBackground':'#FFFFFF'},'flowchart':{'curve':'orthogonal','padding':30,'nodeSpacing':65,'rankSpacing':75,'useMaxWidth':true}}}%%
flowchart TD
subgraph OnDevice["📱 On-Device Inference — Privacy First"]
CML["Core ML (iOS)
Apple Neural Engine
ANE 15-38 TOPS
.mlmodel format"]
TFL["TensorFlow Lite (Android)
GPU delegate
NNAPI delegate
Int8 quantisation"]
MLKIT["Google ML Kit
Pre-built models
OCR · Face · Barcode
Zero model management"]
GN["Gemini Nano
Pixel 8+ · iPhone 15 Pro+
On-device LLM
Apple Intelligence"]
end
subgraph Cloud["☁ Cloud Inference — Complex Models"]
BFF2["Mobile BFF
ML Request routing
Data minimisation
Transient processing"]
ML["ML Service
Large model inference
Returns structured result"]
end
subgraph Privacy["🔒 Privacy Principles"]
P1["Minimise data
leaving device"]
P2["On-device for
PII · Biometrics · PHI"]
P3["Cloud: explicit consent
Transient storage only"]
end
CML & TFL & MLKIT & GN --> Privacy
BFF2 --> ML
Privacy -.- BFF2
style OnDevice fill:#E3F2FD,stroke:#1565C0
style Cloud fill:#FFF3E0,stroke:#E65100
style Privacy fill:#FFEBEE,stroke:#B71C1C
References
- Apple — Core ML Documentation. developer.apple.com/documentation/coreml
- Google — ML Kit. developers.google.com/ml-kit
- TensorFlow — TensorFlow Lite. tensorflow.org/lite
- Google — Android AICore (Gemini Nano). developer.android.com/ml/gemini-nano
- NIST — AI Risk Management Framework. nist.gov/system/files/documents/2023/01/26/AI RMF 1.0.pdf
Mobile Engineering Reference
← Mobile Development