December 19, 2024TutorialsAI

By Malik Chohra

Voice AI Integration in React Native: Complete Implementation Guide 2025

Build voice-powered AI features in React Native apps. Complete guide covering speech recognition, text-to-speech, and voice AI processing with practical examples.

How do you integrate Voice AI in React Native?

Integrate Voice AI in React Native using speech-to-text APIs (Whisper, Apple Speech, Android SpeechRecognizer), text-to-speech (ElevenLabs or native platform TTS), and an AI backend (OpenAI, Claude) to process the transcription. The core loop is always the same: record audio, transcribe it, run it through your LLM, speak the response back. The differences that actually matter are where that transcription happens and how you hide the latency from your user.

Voice input is one of those features that sounds simple until you build it. Recording a clip and posting it to Whisper is fifteen lines of code. Making it feel fast, reliable, and trustworthy in a real app is a different problem. This guide covers the three approaches you can take, where each one breaks down, and the UX decisions that separate apps users love from apps users abandon after two tries.

The three approaches to speech-to-text in React Native

There is no universally correct answer here. The right choice depends on your latency budget, your privacy requirements, and how much infrastructure you want to own. Here is how each approach actually behaves in production.

Approach 1: Whisper API (OpenAI cloud)

The Whisper API is the fastest path to production-quality transcription. You record audio with expo-av, encode it as an m4a or wav file, POST it to https://api.openai.com/v1/audio/transcriptions, and get back a transcript. Accuracy across accents and background noise is genuinely impressive. The model handles technical jargon, mixed languages in the same sentence, and low-quality mobile microphone recordings better than the platform APIs do.

The catch is round-trip latency. You are sending a potentially large audio file to a remote server and waiting for a response. For a typical ten-second voice note, that round trip adds 800ms to 2s depending on network conditions. For a health check-in flow where the user taps a button, speaks for thirty seconds, then waits for a summary, that is fine. For a conversational back-and-forth where the user expects a reply in under a second, it is not.

The recording setup with expo-av looks like this:

import { Audio } from 'expo-av';

async function startRecording() {
  await Audio.requestPermissionsAsync();
  await Audio.setAudioModeAsync({
    allowsRecordingIOS: true,
    playsInSilentModeIOS: true,
  });

  const { recording } = await Audio.Recording.createAsync(
    Audio.RecordingOptionsPresets.HIGH_QUALITY
  );

  return recording;
}

async function transcribeWithWhisper(recording: Audio.Recording) {
  const uri = recording.getURI();
  if (!uri) throw new Error('No recording URI');

  const formData = new FormData();
  formData.append('file', {
    uri,
    type: 'audio/m4a',
    name: 'recording.m4a',
  } as any);
  formData.append('model', 'whisper-1');

  const response = await fetch(
    'https://api.openai.com/v1/audio/transcriptions',
    {
      method: 'POST',
      headers: {
        Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
      },
      body: formData,
    }
  );

  const data = await response.json();
  return data.text as string;
}

One practical detail: do not call the Whisper API directly from the client with your OpenAI key embedded in the app. Route it through your own backend endpoint. The key will end up in your binary otherwise, and that is a bad day waiting to happen.

Approach 2: On-device (Apple Speech + Android SpeechRecognizer)

Both iOS and Android ship with built-in speech recognition that runs partially or fully on-device. The @react-native-voice/voice library wraps both platform APIs behind a consistent JavaScript interface. The main advantage is latency: because the model runs locally (or uses a very low-latency platform endpoint Apple and Google maintain), you get partial results while the user is still speaking. You can update a text field in real time as words come in, which makes the experience feel immediate.

The tradeoff is accuracy and consistency. Apple's on-device model is solid for standard English but struggles with domain-specific vocabulary, heavy accents, and anything that is not a major language. Android's SpeechRecognizer is inconsistent across manufacturers and Android versions. You will encounter devices where it silently fails with no useful error. You need defensive code around every call.

The library setup is straightforward, but silence detection is where most implementations fall apart. By default, the platform APIs stop listening after a short pause. That is fine for short commands but breaks for anything conversational. You need to handle the onSpeechEnd event and restart listening if you want continuous capture, which introduces its own complexity around deduplication and final result detection.

import Voice, {
  SpeechResultsEvent,
  SpeechErrorEvent,
} from '@react-native-voice/voice';
import { useEffect, useRef, useState } from 'react';

export function useVoiceRecognition() {
  const [transcript, setTranscript] = useState('');
  const [isListening, setIsListening] = useState(false);
  const partialRef = useRef('');

  useEffect(() => {
    Voice.onSpeechResults = (e: SpeechResultsEvent) => {
      const result = e.value?.[0] ?? '';
      setTranscript(result);
      partialRef.current = result;
    };

    Voice.onSpeechEnd = () => {
      setIsListening(false);
    };

    Voice.onSpeechError = (e: SpeechErrorEvent) => {
      console.warn('Speech error:', e.error);
      setIsListening(false);
    };

    return () => {
      Voice.destroy().then(Voice.removeAllListeners);
    };
  }, []);

  const start = async () => {
    setTranscript('');
    setIsListening(true);
    await Voice.start('en-US');
  };

  const stop = async () => {
    await Voice.stop();
    setIsListening(false);
  };

  return { transcript, isListening, start, stop };
}

Approach 3: Hybrid (on-device for speed, Whisper for accuracy)

The hybrid approach uses the platform APIs for the live transcription display, giving the user immediate feedback as they speak, and then sends the final audio to Whisper for a high-accuracy pass before submitting to your LLM. The user sees their words appearing in real time, which feels responsive, and your backend processes the higher-quality Whisper transcription.

This is the right choice for health apps, mental wellness check-ins, or anything where the transcription accuracy directly affects the quality of the AI response. A single misheard word in a mood check-in can produce a completely wrong AI reply. The extra 500ms Whisper adds on the backend is invisible to the user because they are already reading the live transcript.

For privacy-sensitive health use cases specifically, you should also consider whether you want audio leaving the device at all. Apple's on-device speech recognition can be forced to run fully locally on iOS 16+ by setting requiresOnDeviceRecognition to true. The accuracy drop is real but acceptable for many use cases, and the privacy story is much cleaner: you can honestly tell users their audio never leaves their phone.

The latency problem and how to handle it in your UI

Latency is the thing that kills voice features in otherwise good apps. The failure mode is always the same: user finishes speaking, nothing happens for two seconds, user taps the button again thinking it did not register, now you have a double submission. Or the user just gives up and types.

The fix is not making your API calls faster. It is giving the user something to look at during the wait. A static spinner is the worst option because it gives no signal about progress. A waveform animation that reacts to the recording state feels active. A live transcript updating character by character (even if it is the lower-accuracy platform output) makes the wait feel shorter because something is happening.

The practical pattern that works well: show a pulsing waveform animation during recording, switch to a processing state with a progress indicator immediately when the user stops speaking, then transition to the response. The waveform animation signals that the app is listening. The transition away from it signals that the app heard something and is working on it. Users will wait several seconds for a response if they believe the system is processing. They will not wait two seconds if they are not sure the app registered their input.

For the AI response side, streaming matters. If your LLM call returns a full response before displaying anything, you are adding artificial latency. Stream the text tokens as they arrive and render them incrementally. The first word appearing 400ms after the transcription finishes feels fast even if the full response takes three seconds.

Text-to-speech: ElevenLabs vs. native platform TTS

Native TTS is free, available offline, requires zero integration work with expo-speech, and sounds robotic. ElevenLabs produces voices that pass for human in blind listening tests, costs money per character, and adds a network round-trip. The question is what your app is actually doing with the spoken output.

For utility features, status readbacks, and anything where the content matters more than the delivery, use native TTS. It is fast, it works offline, and users are not going to complain that their calendar reminder sounds synthetic. expo-speech wraps the platform APIs cleanly:

import * as Speech from 'expo-speech';

function speak(text: string, language = 'en-US') {
  Speech.speak(text, {
    language,
    rate: 0.9,
    onDone: () => console.log('Done speaking'),
    onError: (error) => console.error('TTS error:', error),
  });
}

function stopSpeaking() {
  Speech.stop();
}

ElevenLabs is worth the cost in two scenarios. First, if voice is a core part of your product experience, not a utility feature. An AI companion app, a language learning app where the pronunciation model matters, a wellness app where a warm voice is part of the therapeutic design. Second, if you have a branded voice that is part of your product identity and you want it to sound consistent across every interaction.

The ElevenLabs integration is a straightforward REST call. You POST text to their API, get back an audio buffer, and play it with expo-av. The latency for generating a typical response (two to four sentences) is around 500ms to 1s. You can reduce perceived latency by streaming their audio response and starting playback before the full generation is complete, which their API supports.

One practical note: cache TTS output for repeated phrases. If your app has a greeting it speaks on every session open, generate that audio once, store it locally, and play the file instead of making an API call. This eliminates latency and API costs for your highest-frequency utterances.

Real apps that get voice right

Looking at apps that have shipped voice well is more useful than abstract advice. Three examples worth studying:

Otter.ai's mobile recorder is the best example of progressive disclosure in a voice interface. It shows a live waveform, updates the transcript in real time as you speak, and highlights the current word. The accuracy is not perfect during recording, but it does not matter because users understand they are watching a live draft. The final cleaned transcript arrives after the session. Users have correct expectations because the UI communicates the two-phase model.

Superhuman's voice reply feature takes the opposite approach: it waits for you to finish, shows a brief processing state, then drops in a polished draft. No live transcript, no partial results. This works because the use case is composing email replies, not real-time conversation. Users are comfortable waiting three seconds for a good draft. The UX matches the use case.

Health check-in flows in apps like Woebot and similar mental health tools use voice carefully. They do not try to be fast. They use voice to reduce friction for users who are in an emotional state and find typing difficult. The design prioritizes accuracy and sensitivity over speed. They always show the transcription before submitting it, giving users a chance to correct it. This is the right call for any voice input that goes into a health record or affects a clinical recommendation.

Connecting voice input to your AI backend with streaming responses

The transcription is only half the pipeline. Once you have text from the user's voice, you need to route it through your LLM and get a response back efficiently. This is where streaming matters most, because the gap between transcription completing and the first word of the response appearing is the moment users decide whether the feature feels fast or slow.

AI Mobile Launcher ships with a Claude API integration that handles streaming out of the box. The pattern is to pipe the transcribed text directly into the same message handler you use for text input, accumulate streaming tokens into local state, and render them as they arrive. The voice-specific additions are: marking the message as voice-originated (useful for analytics and potentially for adjusting response length), and optionally piping the response text to your TTS module as sentence-sized chunks arrive.

// Simplified streaming voice response handler
async function handleVoiceInput(transcript: string) {
  const messages = [
    ...conversationHistory,
    { role: 'user', content: transcript, source: 'voice' },
  ];

  let buffer = '';

  const stream = await anthropic.messages.stream({
    model: 'claude-3-5-sonnet-20241022',
    max_tokens: 512,
    messages,
  });

  for await (const chunk of stream) {
    if (
      chunk.type === 'content_block_delta' &&
      chunk.delta.type === 'text_delta'
    ) {
      buffer += chunk.delta.text;
      setCurrentResponse(buffer);

      // Speak sentence-by-sentence as they complete
      if (buffer.endsWith('.') || buffer.endsWith('?')) {
        speakChunk(buffer);
        buffer = '';
      }
    }
  }
}

One nuance worth calling out: voice responses should be shorter than text responses. A three-paragraph reply that works fine as text is exhausting to listen to. If your prompt does not instruct the model to keep voice responses concise, it will not. Add an explicit instruction to your system prompt when you know the input came from voice: something like "the user sent this via voice. Keep your response under 100 words and conversational in tone." This makes the spoken output feel like a dialogue rather than a lecture.

Silence detection and knowing when the user is done

Silence detection is the unsexy part of voice integration that most tutorials skip over. Platform speech APIs stop recording after a short silence, typically around two seconds on iOS and one to three seconds on Android depending on the device. This is fine for short commands but breaks for any kind of natural speech where users pause to think.

The options are: let the platform decide (simplest, worst for conversational use cases), implement manual silence detection using audio amplitude monitoring, or give users explicit control with a hold-to-talk or tap-to-stop button.

Hold-to-talk is underrated. Users understand it intuitively from walkie-talkies and voice messages. It gives them explicit control over exactly what gets transcribed. It eliminates false positives from background noise. The pattern works well for any feature where the voice input is intentional and task-oriented rather than ambient or conversational. Tap-to-start plus tap-to-stop is a close second and works better for longer inputs where holding a button for thirty seconds is uncomfortable.

If you want automatic silence detection, use expo-av's metering API to monitor the audio power level in real time. When the level drops below a threshold and stays there for 1.5 to 2 seconds, stop the recording. This is more reliable than the platform's built-in detection and gives you control over the sensitivity.

Add Voice AI with AI Mobile Launcher

AI Mobile Launcher's AI Pro tier includes a Claude API integration with streaming response handling that plugs directly into a voice input pipeline. The streaming setup means you can start rendering and optionally speaking the AI response as soon as the first tokens arrive, rather than waiting for the full completion. The boilerplate also includes the expo-av recording hooks, permission handling for both iOS and Android, and the message history management you need for multi-turn voice conversations. The architecture is already there. You add the voice feature on top of it rather than building the infrastructure first.