What services does Sachin Sharma offer?

I offer custom software development, mobile app development (Flutter/React Native), web development (React/Next.js), cloud infrastructure (AWS/Docker), and full-stack engineering solutions.

What technologies does Sachin Sharma specialize in?

I specialize in Flutter, React Native, React, Next.js, TypeScript, Node.js, AWS, Docker, and more. I have expertise across mobile, frontend, backend, and DevOps domains.

How many years of experience does Sachin Sharma have?

1.5+ years of professional experience building production-grade applications and leading engineering initiatives.

Where is Sachin Sharma based?

Sachin Sharma is based in Krishan Vihar, Delhi, India. Available for remote work globally.

Sachin's DevLog

Back to Sachin's DevLog

Mobile Engineering

Running Llama 3 on Mobile: The Ultimate Guide to Local LLMs with Flutter

Run Llama 3 locally on Android and iOS with Flutter. A complete guide to using MLC LLM, optimizing quantization (q4f16), and building a private, offline chatbot.

Sachin SharmaCreator

Feb 8, 2026

5 min read

Running Llama 3 on Mobile: The Ultimate Guide to Local LLMs with Flutter

Featured Resource

Quick Overview

Run Llama 3 locally on Android and iOS with Flutter. A complete guide to using MLC LLM, optimizing quantization (q4f16), and building a private, offline chatbot.

Running Llama 3 on Mobile: The Ultimate Guide to Local LLMs with Flutter

For the last two years, "AI" meant "API Call". You send data to OpenAI, they process it, and send it back.

Pros: Easy to implement. Powerful models.
Cons: Expensive. Slow latency. Zero privacy. No offline mode.

The game has changed. With the release of Llama 3, Phi-3, and Gemma, we now have "Small Language Models" (SLMs) that are smart enough to be useful and small enough to fit in RAM.

Today, we are going to do something that sounds impossible: We are going to run a Llama 3 8B model entirely on your phone, integrated into a Flutter app. No internet required. 0ms network latency. 100% privacy.

Part 1: The Tech Stack (MLC LLM)

You can't just run Python/PyTorch on a phone. It's too slow. We need Hardware Acceleration.

iOS: Metal (GPU).
Android: OpenCL / Vulkan (GPU).

MLC LLM (Machine Learning Compilation) is the magic tool. It takes a HuggingFace model, compiles it into a binary format optimized for specific GPUs (using TVM Unity), and exposes a C++ API.

We will use:

2.
Llama-3-8B-Instruct.
4.
MLC LLM for model compilation.
6.
Flutter for the UI.
8.
FFI (Foreign Function Interface) to bridge Dart and C++.

Part 2: Preparing the Model (Quantization)

A standard 8B parameter model at float16 precision is 16GB. Most phones have 8GB or 12GB of RAM. If you load 16GB, the OS kills your app instantly.

We must Quantize. We reduce the precision from 16-bit to 4-bit ("q4f16_1"). This shrinks the model to ~3.5GB. The quality loss is negligible for chat tasks.

Compilation Command (using mlc_llm CLI):


bash
mlc_llm compile ./Llama-3-8B-Instruct  
  --quantization q4f16_1   --device android   --output ./dist/llama-3-8b-q4f16_1.tar

This generates:

model_lib.so (The compiled computation graph).
params (The binary weights).

Part 3: The Flutter Integration

MLC provides a wrapper for Flutter. But integrating it into a production app architecture (like the one I wrote about in my Clean Architecture blog) requires care.

1. Dependency


yaml
dependencies:
  mlc_llm: ^0.1.0
  provider: ^6.0.0

2. The Engine Service

We don't want the UI to talk to the engine directly. We wrap it in a Service.


dart
import 'package:mlc_llm/mlc_llm.dart';

class LLMEngine {
  late MLCEngine _engine;
  final String modelPath;

  bool _isLoaded = false;

  LLMEngine(this.modelPath);

  Future<void> init() async {
    _engine = MLCEngine();
    
    // This is the heavy part. Loads 3.5GB into RAM.
    await _engine.reload(modelPath, modelLib: 'llama_q4f16_1');
    _isLoaded = true;
  }

  Stream<String> generate(String prompt) {
    if (!_isLoaded) throw Exception("Model not loaded");
    
    // Streaming response token by token
    return _engine.chat.completions.createStream(
      messages: [
        ChatCompletionMessage(role: ChatRole.user, content: prompt)
      ],
      temperature: 0.7,
    ).map((chunk) => chunk.choices[0].delta.content ?? "");
  }
}

Part 4: Performance Benchmarks (The Truth)

I tested this on two devices:

2.
iPhone 15 Pro (A17 Pro, 8GB RAM).
4.
Pixel 7 (Tensor G2, 8GB RAM).

Metric 1: Load Time (Cold Start)

iPhone: 4.2 seconds.
Pixel 7: 6.5 seconds.
Analysis: Acceptable for a "boot" screen, but you can't instant-launch the chat.

Metric 2: Speed (Tokens Per Second)

iPhone: 22 tokens/sec. (This is faster than human reading speed!)
Pixel 7: 12 tokens/sec. (Slightly sluggish, but usable).

Metric 3: Battery Drain

Running the LLM fully engages the GPU/NPU.
Drain: ~1% battery per minute of active generation.
Warning: The phone gets HOT. You need to manage thermal throttling.

Part 5: Chat UI & State Management

Since the response streams in, we need a robust UI that doesn't flicker. We use a StreamBuilder or Riverpod StreamProvider.


dart
// UI Snippet
StreamBuilder<String>(
  stream: _llmService.currentResponseStream,
  builder: (context, snapshot) {
    final text = snapshot.data ?? "";
    return MarkdownBody(data: text);
  }
)

Memory Management: The chat history grows. The LLM has a context window (usually 4096 or 8192 tokens). You must trim the history. If context_length > 4000, remove the oldest X messages before sending prompt.

Part 6: Function Calling (The Agentic Mobile App)

Here is where it gets crazy. We can define tools (Calendar, Contacts) and give them to Llama 3 running on-device.

2.
User: "Schedule a meeting with Ajay at 5 PM."
4.
Llama 3 (Offline): Determines intent is calendar_add.
6.
Llama 3 outputs JSON: { "action": "calendar_add", "time": "17:00", "person": "Ajay" }.
8.
Flutter App: Parses JSON, calls Android Calendar API.

You now have Siri, but private, smarter, and fully under your control.

Part 7: Distribution Challenges

You have a 3.5GB model. You cannot bundle this in the APK/IPA (App Store limit is 4GB, but realistic limit is <100MB for downloads).

Solution: "DLC" Pattern.

2.
Publish a lightweight Flutter Chat App (40MB).
4.
On first launch, show a "Downloading AI Model..." screen.
6.
Download the 3.5GB quantized model from your CDN (R2/S3).
8.
Cache it in the App Documents directory.

Cost Warning: If 1,000 users download 3.5GB, that is 3.5TB of bandwidth. Use Cloudflare R2 (zero egress fees) or risk bankruptcy.

Conclusion: The "Private AI" Era

We are entering a bifurcated world.

Cloud AI (GPT-5): For massive reasoning, coding, and creative writing.
Edge AI (Llama 3 Mobile): For personal assistance, privacy-sensitive data (Health, Finance), and zero-latency UI helpers.

As a Flutter developer, mastering Local LLMs puts you ahead of 99% of the market. You are no longer just a UI builder. You are an AI Engineer optimizing neural weights for silicon.

The hardware is ready. The software is ready. Go build something offline.

Resources

About the Author: Sachin Sharma is a Mobile Architect obsessed with Privacy and Performance. He has shipped on-device ML apps that serve millions of users without a single API call.

Flutter Local LLM Llama 3 MLC LLM On-Device AI Mobile Performance

Sachin Sharma

Software Developer & Mobile Engineer

Building digital experiences at the intersection of design and code. Sharing weekly insights on engineering, productivity, and the future of tech.

The Era of Edge Databases: Building Global Apps with Turso and Cloudflare D1

Latency is the new downtime. In this 4,200-word guide, we explore how to move your database to the Edge using SQLite, LibSQL, and Cloudflare Workers. Learn about replication, consistency models, and how to query your DB in 10ms from anywhere in the world.

Mastering tRPC: End-to-End Type Safety Without GraphQL

REST is loose. GraphQL is verbose. tRPC is the future. In this 4,000-word guide, we build a monorepo where changing a database column instantly breaks the frontend build.