Skip to main content
RNAI
Malik Chohra

By Malik Chohra

Building Real-Time Voice AI in React Native with Whisper

A practical guide to implementing real-time audio streaming, Voice Activity Detection (VAD), and OpenAI's Whisper API on iOS and Android.

Push-to-talk interfaces are dead. In 2026, users expect AI assistants to behave like human conversational partners, listening continuously, interrupting gracefully, and responding instantly. Building fluid, real-time voice architectures in React Native requires moving past simple `.wav` file uploads and diving into binary WebSockets, Voice Activity Detection (VAD), and streaming Whisper APIs. Here is exactly how we build it.

💡 Building a voice-first application?

Our AI Mobile Launcher boilerplate includes a perfectly tuned, WebRTC-based React Native agent interface out-of-the-box. Skip the native bridging headaches.

The Push-to-Talk Fallacy

If you build a Voice AI app by having the user press a button, speak, press stop, and wait 3 seconds for Whisper to transcribe the file... your app will fail.

A conversational AI must achieve a "Voice-to-Voice" latency of under 800 milliseconds to feel natural. Standard HTTP `multipart/form-data` uploads physically cannot hit this speed. The only way to achieve conversational latency is through continuous audio chunk streaming.

Step 1: Capturing Raw Audio Buffers

Instead of saving a file to the device disk, we need to intercept the microphone's raw PCM data buffer directly in memory. We utilize the New Architecture (Expo SDK 53 TurboModules) to grab 100ms chunks of `Float32` audio data the moment it hits the microphone.

// Intercepting raw audio buffers in React Native
import { AudioModule } from 'expo-audio';

// We need 16kHz, mono-channel for standard LLM Voice APIs
AudioModule.startRecordingStream(
  { sampleRate: 16000, channels: 1 }, 
  (audioBuffer) => {
    // audioBuffer is a raw Float32Array
    // Send this immediately over a WebSocket
    webSocketRef.current.send(audioBuffer);
  }
);

Is your app's voice feature too slow?

Binary audio streaming over WebSockets requires precise byte-alignment in React Native. We specialize in debugging and rebuilding real-time Voice agent pipelines.

Get Technical Help →

Step 2: Voice Activity Detection (VAD)

If you stream 100% of the microphone data to your backend, you will hemorrhage server costs processing background noise (air conditioners, traffic). You must implement a VAD filter locally on the device.

We deploy a tiny, quantized WebRTC VAD model (specifically `silero-vad` converted for mobile) natively using ONNX or JSI. It analyzes the raw buffer and returns a probability (e.g., `0.98`) that human speech is present.

The VAD Architecture:

  • Buffer silence: Drop the packets. Do not send to network.
  • User starts speaking: The VAD triggers "Active", we open the WebSocket and begin sending chunks.
  • User stops speaking (1 second of silence): The VAD triggers "Endpointing". We signal the LLM that the user is done, triggering the AI response instantly.

Step 3: Streaming to OpenAI's Realtime API

Instead of routing through normal text-based endpoints, we connect the React Native WebSocket directly to OpenAI's Realtime WebSocket (or an Edge Cloudflare proxy to handle authentication).

As the user talks, OpenAI's Whisper model is transcribing the binary chunks in parallel. By the time the VAD signals the "Endpoint", Whisper has already processed 95% of the sentence. The LLM generates the text response in ~200ms, immediately triggering the TTS (Text-to-Speech) generation, and streaming binary audio back down to the mobile device.

Handling "Barge-in" (Interruptions)

The hardest part of voice AI is not speaking, it's shutting up. If the AI is giving a 30-second speech, and the user says "Stop, actually...", the AI must cut itself off immediately.

In React Native, this is solved with full-duplex WebSockets. While the mobile speaker is playing the AI's binary audio chunks, the microphone is still running the VAD. If the VAD detects that the user's volume spikes over the AI's playback volume (utilizing Echo Cancellation), it fires an `'interrupt'` JSON event over the WebSocket. The backend stops sending TTS chunks instantly, and the React Native audio player flushes its playback buffer.

Summary

  • Abandon `multipart/form-data` uploads. True voice agents require binary WebSockets.
  • Extract raw `Float32` audio matrices directly from the microphone using Expo native audio modules.
  • Run VAD (Voice Activity Detection) on-device to save bandwidth and trigger instant AI responses.
  • Implement full-duplex WebSockets to handle Echo Cancellation and human interruptions ("Barge-in") cleanly.

Does your startup need a Voice Agent?

Our engineers build conversational interfaces that feel instantaneous.

Let's Discuss Architecture

Related Articles