Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets
Master browser-native voice engineering. Process mic raw audio via AudioWorklet and stream chunks to Whisper over WebSockets for instant transcription.

Master browser-native voice engineering. Process mic raw audio via AudioWorklet and stream chunks to Whisper over WebSockets for instant transcription.
Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets
With the rise of voice-guided AI agents, real-time speech-to-text (STT) has become a crucial feature for modern web applications. Users expect instant, low-latency transcriptions that update as they speak, matching the seamless conversational experience of Siri or ChatGPT Voice.
Executing raw speech transcription in JavaScript directly in the browser is challenging. However, we can build a highly optimized, low-latency, scalable streaming pipeline by:
- 2.Capturing microphone input using the browser's Web Audio API and AudioWorklet.
- 4.Downsampling and compressing the raw float arrays into lightweight 16kHz linear PCM buffers.
- 6.Streaming these binary audio chunks in real-time over a WebSocket connection.
- 8.Processing the stream using OpenAI's Whisper model on the server and returning text instantly.
In this guide, we will implement this complete client-server voice streaming system step-by-step.
⚡ 1. The Real-Time Audio Pipeline
To stream microphone input continuously without freezing the user interface thread:
- We instantiate an AudioWorkletProcessor running in a separate web audio background thread to capture raw microphone buffers.
- The AudioWorklet downsamples the browser's native audio rate (usually 44.1kHz or 48kHz) down to 16kHz (Whisper's standard input frequency) to reduce network payload.
- The main thread receives the downsampled buffers, packs them as binary arrays, and sends them over a WebSocket connection.
[Mic Input] ──> [AudioWorklet (Main Audio Thread)] ──(16kHz PCM chunks)──> [Main JS Thread]
│
[Instant Text Transcript] <──(JSON text)── [WebSocket Server + Whisper] <───────┘
🏗️ 2. The AudioWorklet Downsampler (downsampler-processor.js)
An AudioWorklet runs inside a dedicated, isolated rendering thread, guaranteeing zero stutter or buffer drops even if the main UI thread undergoes heavy rendering tasks.
Create a file named downsampler-processor.js:
javascriptclass DownsamplerProcessor extends AudioWorkletProcessor { constructor() { super(); this.buffer = []; this.targetSampleRate = 16000; } process(inputs, outputs, parameters) { const input = inputs[0]; if (!input || !input[0]) return true; const channelData = input[0]; // Capture mono audio stream (left channel) // Downsample input from native rate (e.g. 48kHz) to 16kHz const downsampled = this.downsample(channelData, sampleRate, this.targetSampleRate); // Send downsampled Float32Array chunks to the main thread this.port.postMessage(downsampled); return true; } downsample(inputBuffer, sourceRate, targetRate) { if (sourceRate === targetRate) return inputBuffer; const compression = sourceRate / targetRate; const length = Math.round(inputBuffer.length / compression); const result = new Float32Array(length); for (let i = 0; i < length; i++) { result[i] = inputBuffer[Math.round(i * compression)]; } return result; } } registerProcessor('downsampler-processor', DownsamplerProcessor);
💻 3. Implementing the Browser Audio Client
Now, let's write the client-side JavaScript to establish the WebSocket connection, spin up the Web Audio context, load our AudioWorklet, and stream PCM buffers.
javascriptasync function startVoiceStreaming() { const socket = new WebSocket('wss://api.sachinsharma.dev/voice-stream'); socket.binaryType = 'arraybuffer'; socket.onopen = async () => { console.log("🚀 WebSocket Audio connection established!"); await initMicrophoneStream(socket); }; socket.onmessage = (event) => { const data = JSON.parse(event.data); if (data.transcript) { console.log("💬 Instant Transcript:", data.transcript); document.querySelector('#transcript-box').innerText = data.transcript; } }; } async function initMicrophoneStream(socket) { // 1. Request microphone permissions const stream = await navigator.mediaDevices.getUserMedia({ audio: true, video: false }); // 2. Initialize Web Audio Context const audioContext = new AudioContext(); const source = audioContext.createMediaStreamSource(stream); // 3. Load the downsampler AudioWorklet file await audioContext.audioWorklet.addModule('downsampler-processor.js'); // 4. Create Node instance from our registered processor const downsamplerNode = new AudioWorkletNode(audioContext, 'downsampler-processor'); // 5. Connect Microphone to Downsampler source.connect(downsamplerNode); downsamplerNode.connect(audioContext.destination); // 6. Capture downsampled chunks and stream over WebSocket downsamplerNode.port.onmessage = (event) => { const float32PCM = event.data; // Convert Float32Array to 16-bit Int16 signed binary arrays (Int16PCM standard) const int16PCM = convertFloat32ToInt16(float32PCM); if (socket.readyState === WebSocket.OPEN) { socket.send(int16PCM.buffer); } }; } function convertFloat32ToInt16(buffer) { const l = buffer.length; const buf = new Int16Array(l); for (let i = 0; i < l; i++) { // Clamp values between -1.0 and 1.0 to prevent audio clipping const s = Math.max(-1, Math.min(1, buffer[i])); buf[i] = s < 0 ? s * 0x8000 : s * 0x7FFF; } return buf; }
🛡️ 4. Server-Side Integration (Node.js & Whisper)
On the server, we receive binary Int16 PCM chunks via WebSockets, accumulate them into an audio buffer, and stream them to OpenAI's Whisper API using a sliding buffer window.
javascriptimport { WebSocketServer } from 'ws'; import OpenAI from 'openai'; const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY }); const wss = new WebSocketServer({ port: 8080 }); wss.on('connection', (ws) => { console.log("🎙️ Audio client connected!"); let audioBuffer = Buffer.alloc(0); ws.on('message', async (message) => { // message is a binary buffer of Int16 PCM audio audioBuffer = Buffer.concat([audioBuffer, message]); // Send buffer to Whisper once we accumulate ~3 seconds of audio (96,000 bytes at 16kHz 16-bit) if (audioBuffer.length >= 96000) { const tempBuffer = audioBuffer; audioBuffer = Buffer.alloc(0); // Flush buffer try { // Create virtual audio file from binary buffer using OpenAI's API wrapper const transcription = await openai.audio.transcriptions.create({ file: await openai.files.create({ file: tempBuffer, purpose: 'assistants', name: 'speech.raw' // Declare as raw PCM }), model: 'whisper-1', language: 'en' }); ws.send(JSON.stringify({ transcript: transcription.text })); } catch (err) { console.error("❌ Whisper API Error:", err); } } }); });
🏁 5. Conclusion
By separating microphone capture onto an AudioWorklet thread, downsampling to 16kHz on the client, and streaming raw binary arrays over WebSockets, you construct a highly responsive, enterprise-grade voice pipeline. It resolves UI lag completely, enabling next-generation conversational AI interfaces natively in the web browser.

SQLite on the Edge: Replicating Databases with LiteFS and Fly.io
A technical dive into distributed edge storage, exploring how LiteFS replicates SQLite databases across global Fly.io regions using FUSE and lease-based consensus.

Implementing Post-Quantum Cryptography in Next.js: Securing APIs against Future Decryption
Future-proof your web applications today. Learn how to secure Next.js API routes using Post-Quantum Cryptography (PQC) algorithms like ML-KEM and Kyber.