Running Llama 3 on Mobile: The Ultimate Guide to Local LLMs with Flutter
Run Llama 3 locally on Android and iOS with Flutter. A complete guide to using MLC LLM, optimizing quantization (q4f16), and building a private, offline chatbot.

Run Llama 3 locally on Android and iOS with Flutter. A complete guide to using MLC LLM, optimizing quantization (q4f16), and building a private, offline chatbot.
Running Llama 3 on Mobile: The Ultimate Guide to Local LLMs with Flutter
For the last two years, "AI" meant "API Call". You send data to OpenAI, they process it, and send it back.
- Pros: Easy to implement. Powerful models.
- Cons: Expensive. Slow latency. Zero privacy. No offline mode.
The game has changed. With the release of Llama 3, Phi-3, and Gemma, we now have "Small Language Models" (SLMs) that are smart enough to be useful and small enough to fit in RAM.
Today, we are going to do something that sounds impossible: We are going to run a Llama 3 8B model entirely on your phone, integrated into a Flutter app. No internet required. 0ms network latency. 100% privacy.
Part 1: The Tech Stack (MLC LLM)
You can't just run Python/PyTorch on a phone. It's too slow. We need Hardware Acceleration.
- iOS: Metal (GPU).
- Android: OpenCL / Vulkan (GPU).
MLC LLM (Machine Learning Compilation) is the magic tool. It takes a HuggingFace model, compiles it into a binary format optimized for specific GPUs (using TVM Unity), and exposes a C++ API.
We will use:
- 2.Llama-3-8B-Instruct.
- 4.MLC LLM for model compilation.
- 6.Flutter for the UI.
- 8.FFI (Foreign Function Interface) to bridge Dart and C++.
Part 2: Preparing the Model (Quantization)
A standard 8B parameter model at float16 precision is 16GB. Most phones have 8GB or 12GB of RAM. If you load 16GB, the OS kills your app instantly.
We must Quantize. We reduce the precision from 16-bit to 4-bit ("q4f16_1"). This shrinks the model to ~3.5GB. The quality loss is negligible for chat tasks.
Compilation Command (using mlc_llm CLI):
bashmlc_llm compile ./Llama-3-8B-Instruct --quantization q4f16_1 --device android --output ./dist/llama-3-8b-q4f16_1.tar
This generates:
- model_lib.so (The compiled computation graph).
- params (The binary weights).
Part 3: The Flutter Integration
MLC provides a wrapper for Flutter. But integrating it into a production app architecture (like the one I wrote about in my Clean Architecture blog) requires care.
1. Dependency
yamldependencies: mlc_llm: ^0.1.0 provider: ^6.0.0
2. The Engine Service
We don't want the UI to talk to the engine directly. We wrap it in a Service.
dartimport 'package:mlc_llm/mlc_llm.dart'; class LLMEngine { late MLCEngine _engine; final String modelPath; bool _isLoaded = false; LLMEngine(this.modelPath); Future<void> init() async { _engine = MLCEngine(); // This is the heavy part. Loads 3.5GB into RAM. await _engine.reload(modelPath, modelLib: 'llama_q4f16_1'); _isLoaded = true; } Stream<String> generate(String prompt) { if (!_isLoaded) throw Exception("Model not loaded"); // Streaming response token by token return _engine.chat.completions.createStream( messages: [ ChatCompletionMessage(role: ChatRole.user, content: prompt) ], temperature: 0.7, ).map((chunk) => chunk.choices[0].delta.content ?? ""); } }
Part 4: Performance Benchmarks (The Truth)
I tested this on two devices:
- 2.iPhone 15 Pro (A17 Pro, 8GB RAM).
- 4.Pixel 7 (Tensor G2, 8GB RAM).
Metric 1: Load Time (Cold Start)
- iPhone: 4.2 seconds.
- Pixel 7: 6.5 seconds.
- Analysis: Acceptable for a "boot" screen, but you can't instant-launch the chat.
Metric 2: Speed (Tokens Per Second)
- iPhone: 22 tokens/sec. (This is faster than human reading speed!)
- Pixel 7: 12 tokens/sec. (Slightly sluggish, but usable).
Metric 3: Battery Drain
- Running the LLM fully engages the GPU/NPU.
- Drain: ~1% battery per minute of active generation.
- Warning: The phone gets HOT. You need to manage thermal throttling.
Part 5: Chat UI & State Management
Since the response streams in, we need a robust UI that doesn't flicker.
We use a StreamBuilder or Riverpod StreamProvider.
dart// UI Snippet StreamBuilder<String>( stream: _llmService.currentResponseStream, builder: (context, snapshot) { final text = snapshot.data ?? ""; return MarkdownBody(data: text); } )
Memory Management:
The chat history grows. The LLM has a context window (usually 4096 or 8192 tokens).
You must trim the history.
If context_length > 4000, remove the oldest X messages before sending prompt.
Part 6: Function Calling (The Agentic Mobile App)
Here is where it gets crazy. We can define tools (Calendar, Contacts) and give them to Llama 3 running on-device.
- 2.User: "Schedule a meeting with Ajay at 5 PM."
- 4.Llama 3 (Offline): Determines intent is
calendar_add. - 6.Llama 3 outputs JSON:
{ "action": "calendar_add", "time": "17:00", "person": "Ajay" }. - 8.Flutter App: Parses JSON, calls Android Calendar API.
You now have Siri, but private, smarter, and fully under your control.
Part 7: Distribution Challenges
You have a 3.5GB model. You cannot bundle this in the APK/IPA (App Store limit is 4GB, but realistic limit is <100MB for downloads).
Solution: "DLC" Pattern.
- 2.Publish a lightweight Flutter Chat App (40MB).
- 4.On first launch, show a "Downloading AI Model..." screen.
- 6.Download the 3.5GB quantized model from your CDN (R2/S3).
- 8.Cache it in the App Documents directory.
Cost Warning: If 1,000 users download 3.5GB, that is 3.5TB of bandwidth. Use Cloudflare R2 (zero egress fees) or risk bankruptcy.
Conclusion: The "Private AI" Era
We are entering a bifurcated world.
- Cloud AI (GPT-5): For massive reasoning, coding, and creative writing.
- Edge AI (Llama 3 Mobile): For personal assistance, privacy-sensitive data (Health, Finance), and zero-latency UI helpers.
As a Flutter developer, mastering Local LLMs puts you ahead of 99% of the market. You are no longer just a UI builder. You are an AI Engineer optimizing neural weights for silicon.
The hardware is ready. The software is ready. Go build something offline.
Resources
About the Author: Sachin Sharma is a Mobile Architect obsessed with Privacy and Performance. He has shipped on-device ML apps that serve millions of users without a single API call.

The Era of Edge Databases: Building Global Apps with Turso and Cloudflare D1
Latency is the new downtime. In this 4,200-word guide, we explore how to move your database to the Edge using SQLite, LibSQL, and Cloudflare Workers. Learn about replication, consistency models, and how to query your DB in 10ms from anywhere in the world.

Mastering tRPC: End-to-End Type Safety Without GraphQL
REST is loose. GraphQL is verbose. tRPC is the future. In this 4,000-word guide, we build a monorepo where changing a database column instantly breaks the frontend build.