Voice to Voice AI (Part 1)

Voice to LLM to Voice is a technology that allows you to have natural, spoken conversations with artificial intelligence systems. Rather than typing questions and reading responses, you simply speak to the AI as you would to another person, and it responds back using speech.

The process works like a sophisticated telephone conversation: you speak your question, the system understands what you've said, processes it through a powerful AI language model (like ChatGPT or Gemini), and then speaks the answer back to you in a natural voice.

We see the opportunities will grow as this becomes faster, more natural and as LLM gets more weaved into existing voice assistants:

Accessibility: People whose preferred interactions are not typing, have visual impairments, verbal communication provides an easy interaction with AI systems without barriers.

Hands-free convenience: You can ask questions where a keyboard doesn’t exist, like driving, cooking, exercising, or when your hands are otherwise occupied.

Natural interaction: Speaking feels more intuitive than typing, making AI assistance feel like a natural conversation or continuously listening, rather than a computer interface.

Multitasking: Voice interaction allows you to continue other activities whilst getting information or assistance from AI.

What we’ve been playing with

The reason this is part 1 is because ultimately we want to retain the additional context provided by intonation, pitch, volume, pace, etc. but to start we wanted to build something that worked where we can evolve. To achieve this we followed this basic approach:

Audio Input: Start microphone and recording
Speech Recognition: Stream audio to AssemblyAI via WebSocket
Text Processing: Send transcription to Gemini/OpenAI for response
Speech Synthesis: Convert AI response to audio via ElevenLabs
Audio Output: Play generated speech while muting microphone
Loop: Return to listening state for next interaction

We selected a handful of libraries to make this a reality:

Google Generative AI - Gemini API for text generation
OpenAI - Alternative AI provider for text generation
ElevenLabs - High-quality text-to-speech synthesis
WebSocket - Real-time communication with AssemblyAI
node-record-lpcm16 - Audio recording
play-sound - Audio playback functionality

There are other interesting containerised cascade models like this being produced that allow for modularity to extend existing AI systems - Unmute.sh is one of the more interesting ones out there.

Next we want to explore how the other aspects of speech can enhance the context and ultimately the output from the LLM’s. To do this we will consider:

Enable AssemblyAI's sentiment analysis - giving granular emotional context
Add disfluency detection - the identification of interruptions, hesitations, and irregularities
Implement speaker confidence scoring - certainty regarding the speech-to-text translation
Integrate emotion detection - to ascertain human expression
Add speaking rate analysis - helps understand the speaker's emotional state, cognitive load, etc.
Pitch tracking - providing insights to speaking style

Most importantly we need to understand how we interpret these prior, and how we want to use them to inform the next layer of voice interface to AI.

What we’ve been playing with

iOS 18 and integrating our apps with Siri and AI

Multiplier mindset