Modern Web

Building an AudioWorklet-Powered Real-Time Speech Activity Detector (SAD) inside the Browser

Master client-side audio DSP. Code a real-time voice activity and speech detector inside an AudioWorkletProcessor to throttle WebSocket audio payload streams.

Sachin SharmaCreator

Jun 5, 2026

6 min read

Building an AudioWorklet-Powered Real-Time Speech Activity Detector (SAD) inside the Browser

Featured Resource

Quick Overview

Master client-side audio DSP. Code a real-time voice activity and speech detector inside an AudioWorkletProcessor to throttle WebSocket audio payload streams.

Building an AudioWorklet-Powered Real-Time Speech Activity Detector (SAD) inside the Browser

In real-time voice applications—such as AI assistants, live transcription streams (Whisper), or WebRTC voice call platforms—streaming continuous microphone inputs over the network is highly inefficient.

If a user remains silent for 30 seconds, streaming silent audio packets consumes unnecessary server CPU, increases network bandwidth costs, and forces transcription APIs to process blank payloads.

To optimize bandwidth and server utilization, you need a client-side Speech Activity Detector (SAD) or Voice Activity Detector (VAD).

By analyzing the audio stream's energy metrics inside an AudioWorkletProcessor on the client, we can determine if the user is speaking. The client only initiates WebSocket streaming when active speech is detected, shutting down the socket payload during silence.

In this systems guide, we will write a custom AudioWorklet that calculates Root-Mean-Square (RMS) energy and short-time zero-crossing rates in real-time to build a zero-main-thread speech detector.

⚡ 1. The Speech Detection Pipeline

The VAD controller operates inside the audio render thread:

2.
Audio Block Collection: Collect input samples in the AudioWorklet (128 samples per block).
4.
RMS Energy Calculation: Compute the average signal power (Root-Mean-Square) over a sliding frame buffer (e.g., 2048 samples).
6.
Spectral Metric Check: Track Zero-Crossing Rate (ZCR) to differentiate high-frequency noise/sibilance (like wind or background hiss) from actual human speech.
8.
Hysteresis Filter State: Apply threshold gating with hold limits (e.g. remaining in the "Active" state for 500ms after volume drops to prevent cutting off the ends of sentences).
10.
State Signaling: Inform the main thread via MessagePorts when speech starts or stops to gate the WebSocket streaming socket.

[Mic Input (PCM)] ──> [AudioWorkletProcessor (Core)]
                             │
                  (Calculate RMS & ZCR metrics)
                             │
                  (Apply Threshold & Gating)
                             ▼
  [State Changed?] ──(postMessage: true/false)──> [Main Thread WebSocket Gate]
                                                             │
[Whisper Server] <── (Stream Audio Chunks) <─────────────────┘

🏗️ 2. Coding the AudioWorklet VAD Processor

Let's write our custom SpeechDetectorProcessor script. It tracks the signal metrics over a sliding window.


javascript
// speech-detector-processor.js

class SpeechDetectorProcessor extends AudioWorkletProcessor {
  constructor() {
    super();
    this.windowSize = 2048; // ~46ms window at 44.1kHz
    this.historyBuffer = new Float32Array(this.windowSize);
    this.writePointer = 0;

    // Adjustable thresholds
    this.rmsThreshold = 0.015; // Noise gate volume limit
    this.zcrThreshold = 0.15;  // Limit to filter out continuous sibilance/hiss

    // Gating states to prevent stuttering
    this.speechActive = false;
    this.silenceTimeoutFrames = 15; // Number of frames (~340ms) to hold open
    this.silenceCounter = 0;
  }

  process(inputs, outputs, parameters) {
    const input = inputs[0];
    if (!input || input.length === 0) return true;

    const channelData = input[0]; // Mono input channel

    // 1. Write incoming 128 samples into our circular history buffer
    for (let i = 0; i < channelData.length; i++) {
      this.historyBuffer[this.writePointer] = channelData[i];
      this.writePointer = (this.writePointer + 1) % this.windowSize;
    }

    // 2. Calculate Root-Mean-Square (RMS) Energy of the window
    let sumSquares = 0;
    for (let i = 0; i < this.windowSize; i++) {
      sumSquares += this.historyBuffer[i] * this.historyBuffer[i];
    }
    const rms = Math.sqrt(sumSquares / this.windowSize);

    // 3. Calculate Zero-Crossing Rate (ZCR)
    let zeroCrossings = 0;
    for (let i = 1; i < this.windowSize; i++) {
      const prev = this.historyBuffer[i - 1];
      const curr = this.historyBuffer[i];
      // Check if the signal crosses the zero axis
      if ((prev < 0 && curr >= 0) || (prev > 0 && curr <= 0)) {
        zeroCrossings++;
      }
    }
    const zcr = zeroCrossings / this.windowSize;

    // 4. Evaluate Speech Gating Metrics
    // Human speech has high relative energy and structured moderate crossing rates
    const isVoiceCandidate = rms > this.rmsThreshold && zcr < this.zcrThreshold;

    if (isVoiceCandidate) {
      this.silenceCounter = 0;
      if (!this.speechActive) {
        this.speechActive = true;
        // Broadcast speech state transition to main thread
        this.port.postMessage({ type: 'SPEECH_START', metrics: { rms, zcr } });
      }
    } else {
      if (this.speechActive) {
        this.silenceCounter++;
        // Hold the active state open to prevent slicing syllable pauses
        if (this.silenceCounter >= this.silenceTimeoutFrames) {
          this.speechActive = false;
          this.port.postMessage({ type: 'SPEECH_END', metrics: { rms, zcr } });
        }
      }
    }

    return true; // Keep processor alive
  }
}

registerProcessor('speech-detector-processor', SpeechDetectorProcessor);

💻 3. Implementing the Client-Side Gated Audio Controller

Now, let's write our main application code. It listens to the VAD messages from the AudioWorklet thread and handles opening, closing, or routing audio chunks down the WebSocket path accordingly.


javascript
// vad-controller.js

let audioCtx;
let sourceNode;
let vadNode;
let whisperSocket;
let isStreaming = false;

async function initVADEngine() {
  audioCtx = new (window.AudioContext || window.webkitAudioContext)();

  // 1. Request microphone access
  const stream = await navigator.mediaDevices.getUserMedia({
    audio: { channelCount: 1, echoCancellation: true }
  });

  // 2. Load the VAD worklet module
  await audioCtx.audioWorklet.addModule('/js/speech-detector-processor.js');

  sourceNode = audioCtx.createMediaStreamSource(stream);
  vadNode = new AudioWorkletNode(audioCtx, 'speech-detector-processor');

  // Connect nodes
  sourceNode.connect(vadNode);
  
  // Mute monitor to prevent recursive loops
  const silentGain = audioCtx.createGain();
  silentGain.gain.value = 0.0;
  vadNode.connect(silentGain);
  silentGain.connect(audioCtx.destination);

  // 3. Listen for VAD Gating state updates from the Audio thread
  vadNode.port.onmessage = (event) => {
    const { type, metrics } = event.data;

    if (type === 'SPEECH_START') {
      console.log(`🎙️ [VAD] Speech started! RMS: \${metrics.rms.toFixed(4)}, ZCR: \${metrics.zcr.toFixed(4)}`);
      startWebSocketStreaming();
    } else if (type === 'SPEECH_END') {
      console.log(`🛑 [VAD] Silence detected. Gating stream. RMS: \${metrics.rms.toFixed(4)}`);
      stopWebSocketStreaming();
    }
  };

  // We also bridge the raw audio stream to push data to the socket
  // In production, we downsample or pipe audio buffer loops here
}

function startWebSocketStreaming() {
  if (isStreaming) return;
  isStreaming = true;

  // Open socket dynamically when speaking starts!
  whisperSocket = new WebSocket("wss://api.sachinsharma.dev/whisper-stream");
  whisperSocket.binaryType = 'arraybuffer';
  
  whisperSocket.onopen = () => {
    console.log("📡 WebSocket tunnel open. Streaming mic PCM data...");
  };
}

function stopWebSocketStreaming() {
  if (!isStreaming) return;
  isStreaming = false;

  if (whisperSocket) {
    // Gracefully close connection during periods of silence
    whisperSocket.close();
    whisperSocket = null;
  }
}

📊 5. Performance and Bandwidth Savings

We benchmarked a typical 5-minute voice conference session containing 1 minute of actual speech and 4 minutes of listening/silence:

Continuous Streaming (Standard Web Recording):
- WebSocket Payload Duration: 300 seconds.
- Total Data Transmitted: ~9.6 MB (16kHz, mono PCM).
- Server Processing Overhead: Continuous calculations to filter blanks.
AudioWorklet VAD Gated Streaming:
- WebSocket Payload Duration: 60 seconds (streams only during speech).
- Total Data Transmitted: ~1.9 MB (an 80% reduction in bandwidth costs!).
- Server Processing Overhead: Whisper model executes only when voice samples arrive, maximizing API server throughput.

🏁 6. Conclusion

Processing microphone inputs efficiently is key to scaling voice interfaces. By shifting Root-Mean-Square volume analyses and Zero-Crossing Rate filtering out of JavaScript's main loop and straight to AudioWorklet threads, you build low-latency Voice Activity Gating systems that reduce client bandwidth, minimize API server costs, and prevent rendering stutters.

Web Audio API AudioWorklet Speech Activity Detection Audio Processing DSP Performance WebSockets

Sachin Sharma

Software Developer

Building digital experiences at the intersection of design and code. Sharing weekly insights on engineering, productivity, and the future of tech.

Designing a Distributed Job Queue with SQLite and LiteFS at the Edge

Learn how to architect an offline-resilient, distributed background job queue using SQLite WAL mode concurrency and LiteFS transactional replication on Fly.io.

Compiling LLM Tokenizers to WebAssembly: Speeding up Browser-Native AI Pre-processing by 10x

Learn how to optimize browser-native LLM execution. Compile heavy HuggingFace tokenizers from Rust to WebAssembly to eliminate pre-processing bottlenecks in WebGPU pipelines.