Modern Web

Building an AudioWorklet-Powered Real-Time Speech Activity Detector (SAD) inside the Browser

Master client-side audio DSP. Code a real-time voice activity and speech detector inside an AudioWorkletProcessor to throttle WebSocket audio payload streams.

Sachin Sharma
Sachin SharmaCreator
Jun 5, 2026
6 min read
Building an AudioWorklet-Powered Real-Time Speech Activity Detector (SAD) inside the Browser
Featured Resource
Quick Overview

Master client-side audio DSP. Code a real-time voice activity and speech detector inside an AudioWorkletProcessor to throttle WebSocket audio payload streams.

Building an AudioWorklet-Powered Real-Time Speech Activity Detector (SAD) inside the Browser

In real-time voice applications—such as AI assistants, live transcription streams (Whisper), or WebRTC voice call platforms—streaming continuous microphone inputs over the network is highly inefficient.

If a user remains silent for 30 seconds, streaming silent audio packets consumes unnecessary server CPU, increases network bandwidth costs, and forces transcription APIs to process blank payloads.

To optimize bandwidth and server utilization, you need a client-side Speech Activity Detector (SAD) or Voice Activity Detector (VAD).

By analyzing the audio stream's energy metrics inside an AudioWorkletProcessor on the client, we can determine if the user is speaking. The client only initiates WebSocket streaming when active speech is detected, shutting down the socket payload during silence.

In this systems guide, we will write a custom AudioWorklet that calculates Root-Mean-Square (RMS) energy and short-time zero-crossing rates in real-time to build a zero-main-thread speech detector.


⚡ 1. The Speech Detection Pipeline

The VAD controller operates inside the audio render thread:

  1. 2.
    Audio Block Collection: Collect input samples in the AudioWorklet (128 samples per block).
  2. 4.
    RMS Energy Calculation: Compute the average signal power (Root-Mean-Square) over a sliding frame buffer (e.g., 2048 samples).
  3. 6.
    Spectral Metric Check: Track Zero-Crossing Rate (ZCR) to differentiate high-frequency noise/sibilance (like wind or background hiss) from actual human speech.
  4. 8.
    Hysteresis Filter State: Apply threshold gating with hold limits (e.g. remaining in the "Active" state for 500ms after volume drops to prevent cutting off the ends of sentences).
  5. 10.
    State Signaling: Inform the main thread via MessagePorts when speech starts or stops to gate the WebSocket streaming socket.
[Mic Input (PCM)] ──> [AudioWorkletProcessor (Core)]
                             │
                  (Calculate RMS & ZCR metrics)
                             │
                  (Apply Threshold & Gating)
                             ▼
  [State Changed?] ──(postMessage: true/false)──> [Main Thread WebSocket Gate]
                                                             │
[Whisper Server] <── (Stream Audio Chunks) <─────────────────┘

🏗️ 2. Coding the AudioWorklet VAD Processor

Let's write our custom SpeechDetectorProcessor script. It tracks the signal metrics over a sliding window.

javascript
// speech-detector-processor.js class SpeechDetectorProcessor extends AudioWorkletProcessor { constructor() { super(); this.windowSize = 2048; // ~46ms window at 44.1kHz this.historyBuffer = new Float32Array(this.windowSize); this.writePointer = 0; // Adjustable thresholds this.rmsThreshold = 0.015; // Noise gate volume limit this.zcrThreshold = 0.15; // Limit to filter out continuous sibilance/hiss // Gating states to prevent stuttering this.speechActive = false; this.silenceTimeoutFrames = 15; // Number of frames (~340ms) to hold open this.silenceCounter = 0; } process(inputs, outputs, parameters) { const input = inputs[0]; if (!input || input.length === 0) return true; const channelData = input[0]; // Mono input channel // 1. Write incoming 128 samples into our circular history buffer for (let i = 0; i < channelData.length; i++) { this.historyBuffer[this.writePointer] = channelData[i]; this.writePointer = (this.writePointer + 1) % this.windowSize; } // 2. Calculate Root-Mean-Square (RMS) Energy of the window let sumSquares = 0; for (let i = 0; i < this.windowSize; i++) { sumSquares += this.historyBuffer[i] * this.historyBuffer[i]; } const rms = Math.sqrt(sumSquares / this.windowSize); // 3. Calculate Zero-Crossing Rate (ZCR) let zeroCrossings = 0; for (let i = 1; i < this.windowSize; i++) { const prev = this.historyBuffer[i - 1]; const curr = this.historyBuffer[i]; // Check if the signal crosses the zero axis if ((prev < 0 && curr >= 0) || (prev > 0 && curr <= 0)) { zeroCrossings++; } } const zcr = zeroCrossings / this.windowSize; // 4. Evaluate Speech Gating Metrics // Human speech has high relative energy and structured moderate crossing rates const isVoiceCandidate = rms > this.rmsThreshold && zcr < this.zcrThreshold; if (isVoiceCandidate) { this.silenceCounter = 0; if (!this.speechActive) { this.speechActive = true; // Broadcast speech state transition to main thread this.port.postMessage({ type: 'SPEECH_START', metrics: { rms, zcr } }); } } else { if (this.speechActive) { this.silenceCounter++; // Hold the active state open to prevent slicing syllable pauses if (this.silenceCounter >= this.silenceTimeoutFrames) { this.speechActive = false; this.port.postMessage({ type: 'SPEECH_END', metrics: { rms, zcr } }); } } } return true; // Keep processor alive } } registerProcessor('speech-detector-processor', SpeechDetectorProcessor);

💻 3. Implementing the Client-Side Gated Audio Controller

Now, let's write our main application code. It listens to the VAD messages from the AudioWorklet thread and handles opening, closing, or routing audio chunks down the WebSocket path accordingly.

javascript
// vad-controller.js let audioCtx; let sourceNode; let vadNode; let whisperSocket; let isStreaming = false; async function initVADEngine() { audioCtx = new (window.AudioContext || window.webkitAudioContext)(); // 1. Request microphone access const stream = await navigator.mediaDevices.getUserMedia({ audio: { channelCount: 1, echoCancellation: true } }); // 2. Load the VAD worklet module await audioCtx.audioWorklet.addModule('/js/speech-detector-processor.js'); sourceNode = audioCtx.createMediaStreamSource(stream); vadNode = new AudioWorkletNode(audioCtx, 'speech-detector-processor'); // Connect nodes sourceNode.connect(vadNode); // Mute monitor to prevent recursive loops const silentGain = audioCtx.createGain(); silentGain.gain.value = 0.0; vadNode.connect(silentGain); silentGain.connect(audioCtx.destination); // 3. Listen for VAD Gating state updates from the Audio thread vadNode.port.onmessage = (event) => { const { type, metrics } = event.data; if (type === 'SPEECH_START') { console.log(`🎙️ [VAD] Speech started! RMS: \${metrics.rms.toFixed(4)}, ZCR: \${metrics.zcr.toFixed(4)}`); startWebSocketStreaming(); } else if (type === 'SPEECH_END') { console.log(`🛑 [VAD] Silence detected. Gating stream. RMS: \${metrics.rms.toFixed(4)}`); stopWebSocketStreaming(); } }; // We also bridge the raw audio stream to push data to the socket // In production, we downsample or pipe audio buffer loops here } function startWebSocketStreaming() { if (isStreaming) return; isStreaming = true; // Open socket dynamically when speaking starts! whisperSocket = new WebSocket("wss://api.sachinsharma.dev/whisper-stream"); whisperSocket.binaryType = 'arraybuffer'; whisperSocket.onopen = () => { console.log("📡 WebSocket tunnel open. Streaming mic PCM data..."); }; } function stopWebSocketStreaming() { if (!isStreaming) return; isStreaming = false; if (whisperSocket) { // Gracefully close connection during periods of silence whisperSocket.close(); whisperSocket = null; } }

📊 5. Performance and Bandwidth Savings

We benchmarked a typical 5-minute voice conference session containing 1 minute of actual speech and 4 minutes of listening/silence:

  • Continuous Streaming (Standard Web Recording):
    • WebSocket Payload Duration: 300 seconds.
    • Total Data Transmitted: ~9.6 MB (16kHz, mono PCM).
    • Server Processing Overhead: Continuous calculations to filter blanks.
  • AudioWorklet VAD Gated Streaming:
    • WebSocket Payload Duration: 60 seconds (streams only during speech).
    • Total Data Transmitted: ~1.9 MB (an 80% reduction in bandwidth costs!).
    • Server Processing Overhead: Whisper model executes only when voice samples arrive, maximizing API server throughput.

🏁 6. Conclusion

Processing microphone inputs efficiently is key to scaling voice interfaces. By shifting Root-Mean-Square volume analyses and Zero-Crossing Rate filtering out of JavaScript's main loop and straight to AudioWorklet threads, you build low-latency Voice Activity Gating systems that reduce client bandwidth, minimize API server costs, and prevent rendering stutters.

Sachin Sharma

Sachin Sharma

Software Developer

Building digital experiences at the intersection of design and code. Sharing weekly insights on engineering, productivity, and the future of tech.