Modern Web

Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets

Master browser-native voice engineering. Process mic raw audio via AudioWorklet and stream chunks to Whisper over WebSockets for instant transcription.

Sachin SharmaCreator

Jun 1, 2026

5 min read

Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets

Featured Resource

Quick Overview

Master browser-native voice engineering. Process mic raw audio via AudioWorklet and stream chunks to Whisper over WebSockets for instant transcription.

Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets

With the rise of voice-guided AI agents, real-time speech-to-text (STT) has become a crucial feature for modern web applications. Users expect instant, low-latency transcriptions that update as they speak, matching the seamless conversational experience of Siri or ChatGPT Voice.

Executing raw speech transcription in JavaScript directly in the browser is challenging. However, we can build a highly optimized, low-latency, scalable streaming pipeline by:

2.
Capturing microphone input using the browser's Web Audio API and AudioWorklet.
4.
Downsampling and compressing the raw float arrays into lightweight 16kHz linear PCM buffers.
6.
Streaming these binary audio chunks in real-time over a WebSocket connection.
8.
Processing the stream using OpenAI's Whisper model on the server and returning text instantly.

In this guide, we will implement this complete client-server voice streaming system step-by-step.

⚡ 1. The Real-Time Audio Pipeline

To stream microphone input continuously without freezing the user interface thread:

We instantiate an AudioWorkletProcessor running in a separate web audio background thread to capture raw microphone buffers.
The AudioWorklet downsamples the browser's native audio rate (usually 44.1kHz or 48kHz) down to 16kHz (Whisper's standard input frequency) to reduce network payload.
The main thread receives the downsampled buffers, packs them as binary arrays, and sends them over a WebSocket connection.

[Mic Input] ──> [AudioWorklet (Main Audio Thread)] ──(16kHz PCM chunks)──> [Main JS Thread]
                                                                                │
[Instant Text Transcript] <──(JSON text)── [WebSocket Server + Whisper] <───────┘

🏗️ 2. The AudioWorklet Downsampler (`downsampler-processor.js`)

An AudioWorklet runs inside a dedicated, isolated rendering thread, guaranteeing zero stutter or buffer drops even if the main UI thread undergoes heavy rendering tasks.

Create a file named downsampler-processor.js:


javascript
class DownsamplerProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.buffer = [];
    this.targetSampleRate = 16000;
  }

  process(inputs, outputs, parameters) {
    const input = inputs[0];
    if (!input || !input[0]) return true;

    const channelData = input[0]; // Capture mono audio stream (left channel)
    
    // Downsample input from native rate (e.g. 48kHz) to 16kHz
    const downsampled = this.downsample(channelData, sampleRate, this.targetSampleRate);

    // Send downsampled Float32Array chunks to the main thread
    this.port.postMessage(downsampled);
    return true;
  }

  downsample(inputBuffer, sourceRate, targetRate) {
    if (sourceRate === targetRate) return inputBuffer;
    
    const compression = sourceRate / targetRate;
    const length = Math.round(inputBuffer.length / compression);
    const result = new Float32Array(length);
    
    for (let i = 0; i < length; i++) {
      result[i] = inputBuffer[Math.round(i * compression)];
    }
    return result;
  }
}

registerProcessor('downsampler-processor', DownsamplerProcessor);

💻 3. Implementing the Browser Audio Client

Now, let's write the client-side JavaScript to establish the WebSocket connection, spin up the Web Audio context, load our AudioWorklet, and stream PCM buffers.


javascript
async function startVoiceStreaming() {
  const socket = new WebSocket('wss://api.sachinsharma.dev/voice-stream');
  socket.binaryType = 'arraybuffer';

  socket.onopen = async () => {
    console.log("🚀 WebSocket Audio connection established!");
    await initMicrophoneStream(socket);
  };

  socket.onmessage = (event) => {
    const data = JSON.parse(event.data);
    if (data.transcript) {
      console.log("💬 Instant Transcript:", data.transcript);
      document.querySelector('#transcript-box').innerText = data.transcript;
    }
  };
}

async function initMicrophoneStream(socket) {
  // 1. Request microphone permissions
  const stream = await navigator.mediaDevices.getUserMedia({ audio: true, video: false });

  // 2. Initialize Web Audio Context
  const audioContext = new AudioContext();
  const source = audioContext.createMediaStreamSource(stream);

  // 3. Load the downsampler AudioWorklet file
  await audioContext.audioWorklet.addModule('downsampler-processor.js');

  // 4. Create Node instance from our registered processor
  const downsamplerNode = new AudioWorkletNode(audioContext, 'downsampler-processor');

  // 5. Connect Microphone to Downsampler
  source.connect(downsamplerNode);
  downsamplerNode.connect(audioContext.destination);

  // 6. Capture downsampled chunks and stream over WebSocket
  downsamplerNode.port.onmessage = (event) => {
    const float32PCM = event.data;
    
    // Convert Float32Array to 16-bit Int16 signed binary arrays (Int16PCM standard)
    const int16PCM = convertFloat32ToInt16(float32PCM);

    if (socket.readyState === WebSocket.OPEN) {
      socket.send(int16PCM.buffer);
    }
  };
}

function convertFloat32ToInt16(buffer) {
  const l = buffer.length;
  const buf = new Int16Array(l);
  for (let i = 0; i < l; i++) {
    // Clamp values between -1.0 and 1.0 to prevent audio clipping
    const s = Math.max(-1, Math.min(1, buffer[i]));
    buf[i] = s < 0 ? s * 0x8000 : s * 0x7FFF;
  }
  return buf;
}

🛡️ 4. Server-Side Integration (Node.js & Whisper)

On the server, we receive binary Int16 PCM chunks via WebSockets, accumulate them into an audio buffer, and stream them to OpenAI's Whisper API using a sliding buffer window.


javascript
import { WebSocketServer } from 'ws';
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const wss = new WebSocketServer({ port: 8080 });

wss.on('connection', (ws) => {
  console.log("🎙️ Audio client connected!");
  let audioBuffer = Buffer.alloc(0);

  ws.on('message', async (message) => {
    // message is a binary buffer of Int16 PCM audio
    audioBuffer = Buffer.concat([audioBuffer, message]);

    // Send buffer to Whisper once we accumulate ~3 seconds of audio (96,000 bytes at 16kHz 16-bit)
    if (audioBuffer.length >= 96000) {
      const tempBuffer = audioBuffer;
      audioBuffer = Buffer.alloc(0); // Flush buffer

      try {
        // Create virtual audio file from binary buffer using OpenAI's API wrapper
        const transcription = await openai.audio.transcriptions.create({
          file: await openai.files.create({
            file: tempBuffer,
            purpose: 'assistants',
            name: 'speech.raw' // Declare as raw PCM
          }),
          model: 'whisper-1',
          language: 'en'
        });

        ws.send(JSON.stringify({ transcript: transcription.text }));
      } catch (err) {
        console.error("❌ Whisper API Error:", err);
      }
    }
  });
});

🏁 5. Conclusion

By separating microphone capture onto an AudioWorklet thread, downsampling to 16kHz on the client, and streaming raw binary arrays over WebSockets, you construct a highly responsive, enterprise-grade voice pipeline. It resolves UI lag completely, enabling next-generation conversational AI interfaces natively in the web browser.

WebSockets Whisper API Web Audio API Audio Processing AI Voice Real-time Node.js

Sachin Sharma

Software Developer

Building digital experiences at the intersection of design and code. Sharing weekly insights on engineering, productivity, and the future of tech.

SQLite on the Edge: Replicating Databases with LiteFS and Fly.io

A technical dive into distributed edge storage, exploring how LiteFS replicates SQLite databases across global Fly.io regions using FUSE and lease-based consensus.

Implementing Post-Quantum Cryptography in Next.js: Securing APIs against Future Decryption

Future-proof your web applications today. Learn how to secure Next.js API routes using Post-Quantum Cryptography (PQC) algorithms like ML-KEM and Kyber.

Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets

Real-Time Voice Transcription in the Browser: Streaming Audio to Whisper over WebSockets

⚡ 1. The Real-Time Audio Pipeline

🏗️ 2. The AudioWorklet Downsampler (downsampler-processor.js)

💻 3. Implementing the Browser Audio Client

🛡️ 4. Server-Side Integration (Node.js & Whisper)

🏁 5. Conclusion

Sachin Sharma

SQLite on the Edge: Replicating Databases with LiteFS and Fly.io

Implementing Post-Quantum Cryptography in Next.js: Securing APIs against Future Decryption

🏗️ 2. The AudioWorklet Downsampler (`downsampler-processor.js`)