Skip to main content
AIRN
Malik Chohra

By Malik Chohra

On-Device LLM in React Native with llama.rn: A Practical 2026 Guide

Run a large language model entirely on the user's device, no API key, no network call, no token cost. Here's how llama.rn (the React Native binding for llama.cpp) works, when it makes sense, and which models to pick.

On-device LLM inference means your React Native app runs a language model fully on the user's phone, no API call, no network round-trip, no per-token cost. In 2026, this is finally practical for a real subset of use cases. llama.rn, the React Native binding to llama.cpp, is the cleanest way to do it. This post covers what works, what doesn't, and which models to actually ship.

đź’ˇ Pre-integrated

AI Mobile Launcher AI Pro ships llama.rn pre-wired, model download UI with progress, inference service, and a user-facing toggle to switch between cloud (Gemini / OpenAI) and on-device. Skip the integration headache.

“On-device”, “local”, and “offline” LLM: the same thing?

Mostly, yes, and the words matter for what you are actually building. A local LLM in React Native is a model whose weights live on the phone and whose inference runs on the phone's own CPU/GPU. That is the same thing people mean by an on-device LLM. “Offline” is the consequence: because nothing leaves the device, the feature keeps working with the network off. So if you searched for “react native local llm” or “react native offline AI”, this guide is the implementation you want, and llama.rn is the binding that gets you there.

Why on-device, when cloud LLMs are cheaper than ever?

Three reasons that actually justify the engineering effort:

  • Privacy. User data never leaves the device. For health, journaling, finance, or any app handling personally sensitive content, this isn't a nice-to-have, it's a regulatory requirement in some markets.
  • Offline. The app works on a plane, in the metro, in rural areas. For a fitness tracker or daily-journal app, a 30-second “you're offline” gap is product-killing.
  • Per-user cost ceiling. Power users can hit your cloud quota hard. On-device inference puts a flat ceiling on inference cost, you pay for the device download once, not per token.

Bad reasons to use on-device LLMs: latency (cloud is usually faster on modern phones), quality (cloud is still better at the top end), and “it sounds cool.” Pick on-device when one of the three reasons above is load-bearing for your product.

What is llama.rn?

llama.rn is a React Native module that wraps llama.cpp, the C++ inference engine that runs LLaMA, Mistral, Phi, Qwen, and dozens of other open models. It exposes a JavaScript API for loading models, running completions, and streaming tokens.

It uses GGUF format, the standard file format for quantized open-source LLMs. Not ONNX (different format, different toolchain), not Core ML (Apple-only), not TFLite. GGUF is the format you'll find on Hugging Face for almost every model you'd want to run on a phone.

On iOS, llama.rn uses Metal acceleration. On Android, it uses OpenCL or CPU depending on device. In our benchmarks on a 2024 iPhone, a 3B-parameter quantized model produces 25–40 tokens/sec, fast enough for snappy chat UX.

Which models actually fit on a phone?

Hard constraints: phones have 4–8GB of RAM, your model has to share that with the OS and your app. Practically, that means the model file should be under ~2GB and active memory under ~1.5GB.

What that translates to in 2026:

  • Phi-3.5 mini (3.8B) at Q4_K_M quantization, ~2.3GB on disk, ~1.3GB active memory. Strong general-purpose model, surprisingly capable for its size.
  • Llama 3.2 3B Instruct at Q4_K_M, ~1.9GB on disk. The standard for chat-style on-device.
  • Qwen 2.5 1.5B at Q4_K_M, ~1.0GB on disk. Best for older devices or when you need multiple models loaded.
  • Gemma 2 2B at Q4_K_M, ~1.5GB on disk. Strong on creative writing, weaker on code.

Models below 1.5B parameters get dramatically worse, they hallucinate more, follow instructions less reliably, and can't hold a conversation past a few turns. 3B is the sweet spot for “feels like a real assistant” on a phone in 2026.

Model download UX is the real engineering challenge

The model is 1–2GB. Users will not download that on cellular without a warning, and they will absolutely uninstall an app that surprises them with a 2GB download on first launch.

What works in production:

  • Make on-device opt-in, not default. The user picks “I want my AI to run on-device” in Settings.
  • Detect connection type (@react-native-community/netinfo) and warn before downloading on cellular.
  • Show progress with byte counts, ETA, and a pause/resume button. Users tolerate long downloads when they trust they can stop.
  • Use expo-file-system's background download API so the download survives the app being backgrounded.
  • Verify the download with a checksum before activating it, corrupt model files crash inference in confusing ways.

The minimal integration pattern

Three steps:

import { initLlama } from 'llama.rn';

// 1. Load model (do this once, persist the context)
const ctx = await initLlama({
  model: '/path/to/model.gguf',
  n_ctx: 2048,
  n_gpu_layers: 99, // Metal on iOS, OpenCL on Android
});

// 2. Run completion with streaming
await ctx.completion(
  {
    prompt: 'You are a helpful assistant.\nUser: How do I sleep better?\nAssistant:',
    n_predict: 256,
    stop: ['User:', '\n\n'],
  },
  (token) => {
    // Stream tokens back to your UI as they arrive
    onTokenReceived(token.token);
  }
);

// 3. Release when done (free RAM for the rest of the app)
await ctx.release();

That's the entire surface area you need to ship a chat feature. Streaming is essential, without it, users see a 5–10 second blank screen waiting for the full response. With it, the first token arrives in 200–400ms.

Battery and thermal considerations

On-device inference is the most CPU/GPU-intensive thing your app can do. Two minutes of continuous LLM inference can drain ~3–5% of battery and warm the device noticeably.

Practical rules:

  • Cap continuous inference at ~30 seconds per turn. Short-circuit anything longer.
  • Free the model context when the user navigates away from the AI screen, don't keep it loaded forever.
  • Detect low battery (expo-battery) and offer to switch to cloud below ~20%.
  • Don't run inference in the background. iOS will kill your app and Android users will notice the drain.

When NOT to use on-device LLM

  • Anything multimodal. On-device vision/audio models exist but quality is far below cloud. Use Gemini.
  • Long-context tasks. 2k–4k context windows are typical on-device. Cloud models hit 1M+. Don't try to summarize a 100-page PDF on-device.
  • Tool use / agentic flows. Function calling reliability on small on-device models is poor. Use OpenAI's strict JSON mode in the cloud.
  • Apps where output quality is the differentiator. A 3B on-device model is not GPT-4. If your product hinges on getting the best possible output, ship cloud.

The pattern AI Mobile Launcher uses

AI Pro tier ships a Redux slice, llm-preferences-slice.ts, that holds the user's provider choice (cloud Gemini, cloud OpenAI, or on-device llama.rn). A settings screen, LlmChoiceScreen, lets them flip it. Every AI feature reads the preference and routes accordingly.

Result: privacy-conscious users get on-device, cost-sensitive workloads get Gemini, latency-critical chat gets OpenAI, and you write each AI feature once.

See the full AI Pro feature list →

Related reading