Modern Web

Deploying Whisper on the Edge: Real-Time Transcription with WebSockets and Sub-200ms Latency

Learn how to deploy OpenAI Whisper on the edge using Cloudflare Workers AI and fly.io GPU instances with WebSocket streaming for sub-200ms real-time transcription latency.

Sachin SharmaCreator

Jun 7, 2026

18 min read

Deploying Whisper on the Edge: Real-Time Transcription with WebSockets and Sub-200ms Latency

Featured Resource

Quick Overview

Learn how to deploy OpenAI Whisper on the edge using Cloudflare Workers AI and fly.io GPU instances with WebSocket streaming for sub-200ms real-time transcription latency.

Deploying Whisper on the Edge: Real-Time Transcription with WebSockets and Sub-200ms Latency

The cloud round-trip has always been the silent killer of voice UX. A user speaks, audio travels 150ms to a datacenter, queues behind other requests, runs inference, serializes JSON, and travels 150ms back. By then, you're already at 400–600ms — and that's on a good day. For live captioning, voice assistants, or meeting transcription tools, that's the difference between a feature that feels magical and one that feels broken.

In 2026, there are now three serious deployment paths for Whisper-based transcription: Cloudflare Workers AI (serverless, GPU-backed, 200+ edge PoPs), fly.io with GPU machines (persistent, faster-whisper, fully self-hosted), and Groq's LPU API (fastest cloud inference for Whisper-large-v3). I've built and benchmarked all three in production, and this post is the comprehensive guide I wish I had.

We will cover model variant selection, WebSocket protocol design, Voice Activity Detection (VAD) to avoid transcribing silence, speaker diarization basics, error handling for reconnecting clients, and a real cost analysis. All code examples are in TypeScript targeting the Cloudflare Workers runtime.

🎯 Why Edge Deployment Changes the Equation

Traditional Whisper API usage (OpenAI's endpoint) routes every request through a central US datacenter. If your users are in Mumbai, Berlin, or São Paulo, you are adding 120–200ms of pure network latency before a single GPU cycle runs.

Edge deployment solves this by running the inference closer to the user. Cloudflare's AI Workers run at 300+ data centers globally. A user in Singapore hits a PoP that is physically 5ms away. Instead of paying 300ms in RTT, you pay 5ms. The total end-to-end latency for a streaming chunk drops from 600ms to under 200ms.

The second benefit is throughput isolation. On a shared API, a spike in traffic from another customer means your requests queue. On a self-hosted fly.io instance, you control the queue, the concurrency, and the GPU reservation.

CENTRALIZED API FLOW (OpenAI)
┌────────┐    300ms RTT    ┌───────────────┐
│ Client │ ──────────────► │  US Datacenter │
│(Mumbai)│ ◄────────────── │  Whisper API  │
└────────┘                 └───────────────┘
Total latency: ~500–700ms per audio chunk

EDGE FLOW (Cloudflare Workers AI)
┌────────┐     5ms RTT     ┌────────────────────┐
│ Client │ ──────────────► │ CF PoP (Singapore) │
│(Mumbai)│ ◄────────────── │  Whisper Worker    │
└────────┘                 └────────────────────┘
Total latency: ~80–180ms per audio chunk

📦 Choosing the Right Whisper Variant for Edge Constraints

OpenAI released Whisper with five model sizes: tiny, base, small, medium, and large. For edge deployment, the choice is not purely about accuracy — it is about the latency/accuracy tradeoff under memory and compute constraints.

Model	Parameters	VRAM	WER (en)	RTF*	Best For
tiny	39M	~1GB	~14%	0.05x	Low-power edge, keyword spotting
base	74M	~1.5GB	~9%	0.09x	Browser WASM, Cloudflare Workers AI
small	244M	~2.3GB	~6%	0.25x	Cloudflare Workers AI (default)
medium	769M	~5GB	~4%	0.55x	fly.io GPU instances
large-v3	1.5B	~10GB	~2.7%	1.1x	High-accuracy self-hosted / Groq

*RTF = Real-Time Factor. 0.09x means the model processes audio 11x faster than real-time.

For most production applications, whisper-small on Cloudflare Workers AI gives the best balance. For transcription where accuracy is paramount (legal, medical), use whisper-large-v3 on a fly.io L40S GPU.

faster-whisper deserves a special mention. It is a CTranslate2-based reimplementation that achieves 4x speed and 2x lower memory usage compared to the original PyTorch implementation, with identical output quality. On a fly.io A10 GPU, faster-whisper with large-v3 achieves 0.2x RTF — meaning it processes 30 seconds of audio in 6 seconds, all while streaming partial results back over WebSocket.

⚡ Cloudflare Workers AI + Whisper: Setup and Streaming

Cloudflare Workers AI exposes Whisper via @cf/openai/whisper (base) and @cf/openai/whisper-large-v3-turbo (recommended). The turbo variant is distilled and runs in under 100ms for sub-30s audio chunks on Cloudflare's GPU fleet.

Project Setup


bash
npm create cloudflare@latest whisper-edge-worker -- --type worker
cd whisper-edge-worker
npm install hono

Update wrangler.toml:


toml
name = "whisper-edge-worker"
main = "src/index.ts"
compatibility_date = "2026-01-01"
compatibility_flags = ["nodejs_compat"]

[ai]
binding = "AI"

[[durable_objects.bindings]]
name = "TRANSCRIPTION_SESSION"
class_name = "TranscriptionSession"

[[migrations]]
tag = "v1"
new_classes = ["TranscriptionSession"]

The WebSocket-Enabled Durable Object

The critical insight here is that Cloudflare Workers are stateless — a new isolate handles every request. For a WebSocket session, you need a Durable Object to maintain the connection lifetime and buffer audio chunks in sequence.


typescript
// src/session.ts
import { DurableObject } from "cloudflare:workers";

interface Env {
  AI: Ai;
  TRANSCRIPTION_SESSION: DurableObjectNamespace;
}

interface AudioChunk {
  sequenceId: number;
  audioData: ArrayBuffer;
  sampleRate: number;
  timestamp: number;
}

export class TranscriptionSession extends DurableObject {
  private ws: WebSocket | null = null;
  private audioBuffer: Float32Array[] = [];
  private bufferDurationMs = 0;
  private readonly CHUNK_THRESHOLD_MS = 500; // send to Whisper every 500ms
  private readonly SAMPLE_RATE = 16000;
  private processingLock = false;

  async fetch(request: Request): Promise<Response> {
    const upgradeHeader = request.headers.get("Upgrade");
    if (!upgradeHeader || upgradeHeader !== "websocket") {
      return new Response("Expected WebSocket", { status: 426 });
    }

    const [client, server] = Object.values(new WebSocketPair());
    this.ws = server;

    server.accept();

    server.addEventListener("message", async (event) => {
      if (typeof event.data === "string") {
        const msg = JSON.parse(event.data);
        await this.handleControlMessage(msg);
      } else if (event.data instanceof ArrayBuffer) {
        await this.handleAudioChunk(event.data);
      }
    });

    server.addEventListener("close", () => {
      this.audioBuffer = [];
      this.bufferDurationMs = 0;
    });

    return new Response(null, {
      status: 101,
      webSocket: client,
    });
  }

  private async handleControlMessage(msg: { type: string; config?: Record<string, unknown> }) {
    if (msg.type === "start") {
      this.send({ type: "ready", sessionId: this.ctx.id.toString() });
    } else if (msg.type === "flush") {
      await this.flushBuffer();
    }
  }

  private async handleAudioChunk(data: ArrayBuffer) {
    // Binary protocol: first 4 bytes = sequenceId (uint32),
    // next 4 bytes = sampleRate (uint32), rest = PCM Float32 audio
    const view = new DataView(data);
    const sequenceId = view.getUint32(0);
    const sampleRate = view.getUint32(4);
    const pcmData = new Float32Array(data, 8);

    // Resample to 16kHz if needed (Whisper expects 16kHz)
    const resampled = sampleRate === this.SAMPLE_RATE
      ? pcmData
      : this.resample(pcmData, sampleRate, this.SAMPLE_RATE);

    this.audioBuffer.push(resampled);
    this.bufferDurationMs += (resampled.length / this.SAMPLE_RATE) * 1000;

    // Trigger inference when buffer is full enough
    if (this.bufferDurationMs >= this.CHUNK_THRESHOLD_MS && !this.processingLock) {
      await this.flushBuffer();
    }
  }

  private async flushBuffer() {
    if (this.audioBuffer.length === 0 || this.processingLock) return;

    this.processingLock = true;
    const chunks = [...this.audioBuffer];
    this.audioBuffer = [];
    this.bufferDurationMs = 0;

    // Concatenate all buffered Float32 chunks
    const totalLength = chunks.reduce((sum, c) => sum + c.length, 0);
    const combined = new Float32Array(totalLength);
    let offset = 0;
    for (const chunk of chunks) {
      combined.set(chunk, offset);
      offset += chunk.length;
    }

    try {
      const startTime = Date.now();

      const result = await this.env.AI.run("@cf/openai/whisper-large-v3-turbo", {
        audio: Array.from(combined), // Workers AI expects number[]
      }) as { text: string; segments?: Array<{ start: number; end: number; text: string }> };

      const inferenceMs = Date.now() - startTime;

      this.send({
        type: "transcript",
        text: result.text,
        segments: result.segments ?? [],
        inferenceMs,
        isFinal: false,
      });
    } catch (err) {
      this.send({ type: "error", message: String(err) });
    } finally {
      this.processingLock = false;
    }
  }

  private resample(input: Float32Array, fromRate: number, toRate: number): Float32Array {
    const ratio = fromRate / toRate;
    const outputLength = Math.round(input.length / ratio);
    const output = new Float32Array(outputLength);
    for (let i = 0; i < outputLength; i++) {
      const srcIdx = i * ratio;
      const srcIdxFloor = Math.floor(srcIdx);
      const frac = srcIdx - srcIdxFloor;
      output[i] = input[srcIdxFloor] * (1 - frac) + (input[srcIdxFloor + 1] ?? 0) * frac;
    }
    return output;
  }

  private send(data: unknown) {
    this.ws?.send(JSON.stringify(data));
  }
}

The Main Worker Entry Point


typescript
// src/index.ts
import { Hono } from "hono";
import { cors } from "hono/cors";
import { TranscriptionSession } from "./session";

export { TranscriptionSession };

interface Env {
  AI: Ai;
  TRANSCRIPTION_SESSION: DurableObjectNamespace;
}

const app = new Hono<{ Bindings: Env }>();

app.use("/*", cors({ origin: "*" }));

app.get("/transcribe", async (c) => {
  const sessionId = c.req.query("session") ?? crypto.randomUUID();
  const id = c.env.TRANSCRIPTION_SESSION.idFromName(sessionId);
  const stub = c.env.TRANSCRIPTION_SESSION.get(id);
  return stub.fetch(c.req.raw);
});

app.get("/health", (c) => c.json({ ok: true, timestamp: Date.now() }));

export default app;

🔇 Voice Activity Detection: Stop Sending Silence

Sending silent audio to Whisper wastes compute and money. Silence-only chunks produce hallucinated output (Whisper will invent text for silence), increase costs, and pollute your transcription. The solution is Voice Activity Detection (VAD) on the client side.

Silero VAD is a 1.8MB ONNX model that runs in the browser using ONNX Runtime Web. It classifies 30ms audio frames as speech (1.0) or silence (0.0) with ~95% accuracy and runs at under 1ms per frame on a mid-range laptop.


typescript
// client/vad.ts
import { InferenceSession, Tensor } from "onnxruntime-web";

export class SileroVAD {
  private session: InferenceSession | null = null;
  private h: Tensor;
  private c: Tensor;
  private readonly THRESHOLD = 0.5;
  private readonly FRAME_SIZE = 512; // 32ms at 16kHz

  async initialize() {
    this.session = await InferenceSession.create("/models/silero_vad.onnx", {
      executionProviders: ["wasm"],
    });
    // Reset LSTM state
    this.h = new Tensor("float32", new Float32Array(2 * 64), [2, 1, 64]);
    this.c = new Tensor("float32", new Float32Array(2 * 64), [2, 1, 64]);
  }

  async isSpeech(frame: Float32Array): Promise<boolean> {
    if (!this.session) throw new Error("VAD not initialized");

    const input = new Tensor("float32", frame, [1, frame.length]);
    const srTensor = new Tensor("int64", BigInt64Array.from([16000n]), [1]);

    const { output, hn, cn } = await this.session.run({
      input,
      sr: srTensor,
      h: this.h,
      c: this.c,
    });

    // Update LSTM state for next frame
    this.h = hn;
    this.c = cn;

    return output.data[0] as number > this.THRESHOLD;
  }

  reset() {
    this.h = new Tensor("float32", new Float32Array(2 * 64), [2, 1, 64]);
    this.c = new Tensor("float32", new Float32Array(2 * 64), [2, 1, 64]);
  }
}

Integrating VAD into the AudioWorklet Pipeline


typescript
// client/transcription-client.ts
import { SileroVAD } from "./vad";

export class TranscriptionClient {
  private ws: WebSocket | null = null;
  private vad: SileroVAD;
  private audioCtx: AudioContext | null = null;
  private workletNode: AudioWorkletNode | null = null;
  private silenceFrames = 0;
  private readonly SILENCE_FLUSH_THRESHOLD = 10; // 10 silent frames = 320ms → flush
  private sessionId: string;

  constructor(private readonly serverUrl: string) {
    this.vad = new SileroVAD();
    this.sessionId = crypto.randomUUID();
  }

  async start() {
    await this.vad.initialize();
    await this.connectWebSocket();

    const stream = await navigator.mediaDevices.getUserMedia({
      audio: {
        channelCount: 1,
        sampleRate: 16000,
        echoCancellation: true,
        noiseSuppression: true,
      },
    });

    this.audioCtx = new AudioContext({ sampleRate: 16000 });
    await this.audioCtx.audioWorklet.addModule("/audio-processor.js");

    const source = this.audioCtx.createMediaStreamSource(stream);
    this.workletNode = new AudioWorkletNode(this.audioCtx, "audio-processor");

    this.workletNode.port.onmessage = async (e) => {
      const frame: Float32Array = e.data;
      const isSpeech = await this.vad.isSpeech(frame);

      if (isSpeech) {
        this.silenceFrames = 0;
        this.sendAudioChunk(frame);
      } else {
        this.silenceFrames++;
        if (this.silenceFrames === this.SILENCE_FLUSH_THRESHOLD) {
          // End of utterance — flush server buffer
          this.ws?.send(JSON.stringify({ type: "flush" }));
          this.vad.reset();
        }
      }
    };

    source.connect(this.workletNode);
  }

  private sendAudioChunk(pcm: Float32Array) {
    if (!this.ws || this.ws.readyState !== WebSocket.OPEN) return;

    // Binary frame: [sequenceId: uint32][sampleRate: uint32][pcm: float32[]]
    const buffer = new ArrayBuffer(8 + pcm.byteLength);
    const view = new DataView(buffer);
    view.setUint32(0, this.sequenceCounter++);
    view.setUint32(4, 16000);
    new Float32Array(buffer, 8).set(pcm);
    this.ws.send(buffer);
  }

  private sequenceCounter = 0;

  private async connectWebSocket() {
    const url = \`\${this.serverUrl}/transcribe?session=\${this.sessionId}\`;
    this.ws = new WebSocket(url);

    this.ws.binaryType = "arraybuffer";

    this.ws.onopen = () => {
      this.ws!.send(JSON.stringify({ type: "start" }));
    };

    this.ws.onmessage = (e) => {
      const msg = JSON.parse(e.data);
      if (msg.type === "transcript") {
        this.onTranscript?.(msg.text, msg.segments, msg.inferenceMs);
      } else if (msg.type === "error") {
        console.error("Transcription error:", msg.message);
      }
    };

    this.ws.onclose = (e) => {
      if (!e.wasClean) {
        setTimeout(() => this.reconnect(), 1000);
      }
    };
  }

  onTranscript?: (text: string, segments: unknown[], latencyMs: number) => void;
}

🏗️ Self-Hosting faster-whisper on fly.io with GPU Instances

When Cloudflare Workers AI doesn't meet your accuracy or customization needs (e.g., custom vocabulary, speaker diarization, language-specific fine-tuning), fly.io's GPU machines are the next stop. The A10 instance ($1.95/hr) with 24GB VRAM can run whisper-large-v3 with faster-whisper at comfortable concurrency.

The Python WebSocket Server (faster-whisper)


python
# server.py
import asyncio
import json
import struct
import numpy as np
import websockets
from faster_whisper import WhisperModel

model = WhisperModel(
    "large-v3",
    device="cuda",
    compute_type="float16",  # Halves VRAM, negligible accuracy impact
    num_workers=4,
)

BUFFER_DURATION_S = 0.5  # 500ms chunks
SAMPLE_RATE = 16000

async def handle_session(websocket):
    audio_buffer = []
    buffer_samples = 0
    threshold_samples = int(BUFFER_DURATION_S * SAMPLE_RATE)

    async for message in websocket:
        if isinstance(message, str):
            msg = json.loads(message)
            if msg["type"] == "start":
                await websocket.send(json.dumps({"type": "ready"}))
            elif msg["type"] == "flush" and audio_buffer:
                await transcribe_and_send(websocket, audio_buffer)
                audio_buffer = []
                buffer_samples = 0
        else:
            # Binary: [sequenceId: uint32][sampleRate: uint32][pcm: float32[]]
            seq_id, sample_rate = struct.unpack_from(">II", message, 0)
            pcm = np.frombuffer(message[8:], dtype=np.float32).copy()

            if sample_rate != SAMPLE_RATE:
                # Resample with scipy if needed
                from scipy import signal
                pcm = signal.resample(pcm, int(len(pcm) * SAMPLE_RATE / sample_rate))

            audio_buffer.append(pcm)
            buffer_samples += len(pcm)

            if buffer_samples >= threshold_samples:
                await transcribe_and_send(websocket, audio_buffer)
                audio_buffer = []
                buffer_samples = 0

async def transcribe_and_send(websocket, chunks):
    audio = np.concatenate(chunks)
    segments_gen, info = model.transcribe(
        audio,
        beam_size=5,
        language="en",
        vad_filter=True,  # Built-in VAD!
        word_timestamps=True,
    )

    full_text = ""
    segments = []
    for seg in segments_gen:
        full_text += seg.text
        segments.append({
            "start": seg.start,
            "end": seg.end,
            "text": seg.text,
            "words": [{"word": w.word, "start": w.start, "end": w.end} for w in (seg.words or [])],
        })

    await websocket.send(json.dumps({
        "type": "transcript",
        "text": full_text.strip(),
        "segments": segments,
        "language": info.language,
    }))

async def main():
    async with websockets.serve(handle_session, "0.0.0.0", 8080, max_size=10_000_000):
        await asyncio.Future()  # run forever

asyncio.run(main())

fly.toml for GPU Deployment


toml
app = "whisper-transcription-server"
primary_region = "sjc"  # San Jose — close to AWS us-west-2 if you need hybrid

[build]
  dockerfile = "Dockerfile"

[http_service]
  internal_port = 8080
  force_https = true
  auto_stop_machines = false  # Keep warm for latency
  auto_start_machines = true

[[vm]]
  size = "a10"         # 24GB VRAM NVIDIA A10
  memory = "32gb"
  cpu_kind = "performance"
  cpus = 8


dockerfile
FROM nvidia/cuda:12.3.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y python3 python3-pip
RUN pip3 install faster-whisper websockets scipy numpy
COPY server.py .
CMD ["python3", "server.py"]

🗣️ Speaker Diarization: Who Said What

For meeting transcription, you need to know which speaker said which phrase. Pyannote.audio v3.3 is the current state of the art. It runs as a separate model alongside faster-whisper and assigns speaker labels to each Whisper segment by matching timestamps.


python
from pyannote.audio import Pipeline as DiarizationPipeline
import torch

diarization_pipeline = DiarizationPipeline.from_pretrained(
    "pyannote/speaker-diarization-3.3",
    use_auth_token="YOUR_HF_TOKEN",
).to(torch.device("cuda"))

async def diarize(audio_np: np.ndarray, whisper_segments: list) -> list:
    # pyannote expects a waveform dict
    import io, soundfile as sf
    buf = io.BytesIO()
    sf.write(buf, audio_np, 16000, format="WAV")
    buf.seek(0)

    diarization = diarization_pipeline({"waveform": torch.tensor(audio_np).unsqueeze(0), "sample_rate": 16000})

    # Build speaker map: (start, end) -> speaker
    speaker_map = []
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        speaker_map.append((turn.start, turn.end, speaker))

    # Assign speakers to Whisper segments
    for seg in whisper_segments:
        mid = (seg["start"] + seg["end"]) / 2
        speaker = "UNKNOWN"
        for s_start, s_end, s_label in speaker_map:
            if s_start <= mid <= s_end:
                speaker = s_label
                break
        seg["speaker"] = speaker

    return whisper_segments

The combined output looks like:


json
{
  "type": "transcript",
  "segments": [
    { "start": 0.0, "end": 3.2, "text": " Hello everyone.", "speaker": "SPEAKER_00" },
    { "start": 3.5, "end": 7.1, "text": " Thanks for joining.", "speaker": "SPEAKER_01" }
  ]
}

🔌 WebSocket Error Handling and Reconnection Logic

Production WebSocket clients must handle disconnections gracefully. The following client-side TypeScript implements exponential backoff with jitter, preserving the audio buffer across reconnects so no speech is lost.


typescript
// client/reconnecting-ws.ts
export class ReconnectingWebSocket {
  private ws: WebSocket | null = null;
  private reconnectAttempts = 0;
  private readonly MAX_RECONNECT_ATTEMPTS = 10;
  private readonly BASE_DELAY_MS = 500;
  private pendingMessages: (string | ArrayBuffer)[] = [];
  private isManualClose = false;

  constructor(
    private url: string,
    private onMessage: (data: string) => void,
    private onStateChange?: (state: "connected" | "disconnected" | "reconnecting") => void
  ) {}

  connect() {
    this.isManualClose = false;
    this.createSocket();
  }

  private createSocket() {
    this.ws = new WebSocket(this.url);
    this.ws.binaryType = "arraybuffer";

    this.ws.onopen = () => {
      this.reconnectAttempts = 0;
      this.onStateChange?.("connected");

      // Drain any buffered messages
      for (const msg of this.pendingMessages) {
        this.ws!.send(msg);
      }
      this.pendingMessages = [];
    };

    this.ws.onmessage = (e) => {
      if (typeof e.data === "string") {
        this.onMessage(e.data);
      }
    };

    this.ws.onerror = (e) => {
      console.warn("WebSocket error", e);
    };

    this.ws.onclose = (e) => {
      if (this.isManualClose) return;
      this.onStateChange?.("disconnected");

      if (this.reconnectAttempts < this.MAX_RECONNECT_ATTEMPTS) {
        this.scheduleReconnect();
      } else {
        console.error("Max reconnect attempts reached. Giving up.");
      }
    };
  }

  private scheduleReconnect() {
    this.onStateChange?.("reconnecting");
    const delay = Math.min(
      this.BASE_DELAY_MS * Math.pow(2, this.reconnectAttempts) + Math.random() * 200,
      30000 // cap at 30s
    );
    this.reconnectAttempts++;
    setTimeout(() => this.createSocket(), delay);
  }

  send(data: string | ArrayBuffer) {
    if (this.ws?.readyState === WebSocket.OPEN) {
      this.ws.send(data);
    } else {
      // Buffer up to 100 messages during disconnection
      if (this.pendingMessages.length < 100) {
        this.pendingMessages.push(data);
      }
    }
  }

  close() {
    this.isManualClose = true;
    this.ws?.close(1000, "Client closed");
  }
}

💰 Cost Analysis: Cloudflare AI vs. OpenAI API vs. Self-Hosted

For a 1-hour meeting producing 3600 seconds of audio, chunked into 500ms segments (7200 inference calls):

Provider	Pricing Model	Cost/Hour	Latency	Accuracy
OpenAI Whisper API	$0.006/minute	$0.36	400–700ms	High (large-v3 equiv)
Cloudflare Workers AI	$0.0001/request	$0.72	80–180ms	Good (large-v3-turbo)
fly.io A10 (self-hosted)	$1.95/hr flat	$1.95	40–120ms	Highest (configurable)
Groq Whisper API	$0.111/hour audio	$0.111	20–80ms	High (large-v3)

Key insight: At low volume (< 100 active sessions/day), Cloudflare Workers AI is optimal — no infrastructure to manage, globally distributed, pay-per-use. At high volume (> 500 concurrent sessions), a reserved fly.io A10 is cheaper and faster. Groq offers the best latency but has rate limits.

For hybrid deployments: route to Cloudflare Workers AI by default, fail over to Groq when CF has regional issues, and use self-hosted fly.io for premium/enterprise users who need custom models.

HYBRID ROUTING ARCHITECTURE

Client ──► Edge Router (Cloudflare Worker)
              │
              ├── [Default] ──► CF Workers AI (whisper-large-v3-turbo)
              ├── [Premium] ──► fly.io GPU (faster-whisper + diarization)
              └── [Fallback] ──► Groq API

🚀 Full Production Architecture

Here is the complete architecture for a production transcription service handling 10,000 concurrent users:

┌─────────────────────────────────────────────────────────────────┐
│                         CLIENT BROWSER                          │
│                                                                 │
│  MediaStream ──► AudioWorklet ──► SileroVAD ──► ReconnectingWS  │
│                    (16kHz PCM)    (filter silence)  (binary)    │
└─────────────────────────────┬───────────────────────────────────┘
                              │ WebSocket (binary PCM frames)
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                   CLOUDFLARE EDGE (300+ PoPs)                   │
│                                                                 │
│  Worker Entry Point                                             │
│  ├── Auth (JWT validation)                                      │
│  ├── Rate Limiting (Cloudflare KV)                              │
│  └── Route to Durable Object (by sessionId)                     │
│                                                                 │
│  TranscriptionSession (Durable Object)                          │
│  ├── WebSocket lifetime management                              │
│  ├── Audio buffer (500ms windows)                               │
│  ├── AI.run("@cf/openai/whisper-large-v3-turbo")               │
│  └── Streaming transcript responses                             │
└─────────────────────────────┬───────────────────────────────────┘
                              │ (for premium tier)
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                 fly.io GPU CLUSTER (sjc region)                  │
│                                                                 │
│  faster-whisper (large-v3, float16)                             │
│  + pyannote speaker diarization v3.3                            │
│  + Custom vocabulary injection                                  │
└─────────────────────────────────────────────────────────────────┘

Key Production Considerations

1. Auth and Rate Limiting: Generate short-lived JWTs (5 min TTL) for WebSocket URLs. Validate in the Worker before routing to the Durable Object. Store per-user rate limits in Cloudflare KV.

2. Audio Chunk Size Tuning: 500ms chunks are the sweet spot. Smaller chunks (100ms) improve real-time feel but create more inference calls and more hallucinations from short context. Larger chunks (2000ms) improve accuracy but hurt latency.

3. Language Detection: Pass language: null to Whisper on the first chunk to auto-detect, then lock the detected language for subsequent chunks. This avoids re-detection overhead on every chunk.

4. Monitoring: Emit structured logs from every inference call — chunk size, inference latency, detected language, token count, VAD decision. Push to Cloudflare Logpush for analysis in Datadog or Grafana.


typescript
// Structured telemetry per inference
const telemetry = {
  sessionId,
  region: request.cf?.colo,
  chunkDurationMs: bufferDurationMs,
  inferenceMs,
  tokensGenerated: result.text.split(" ").length,
  vadFiltered: false,
  timestamp: Date.now(),
};
console.log(JSON.stringify(telemetry));

🎯 Key Takeaways

After building and benchmarking all three deployment paths, here is what I would recommend:

2.
Start with Cloudflare Workers AI — zero infra, globally fast, $0.72/hour at 7200 chunks/session is cheap for early-stage products.
4.
Add SileroVAD on the client from day one. It cuts your Whisper inference calls by 30–50% (most audio is silence or filler), improving both cost and accuracy.
6.
Use the binary WebSocket protocol described above — do not send base64-encoded audio. Binary frames are 33% smaller and faster to encode/decode.
8.
Buffer 500ms of audio per inference call — this is the balance point for latency vs. accuracy based on Whisper's attention window.
10.
Implement exponential backoff reconnection with pending message buffering. Mobile networks are flaky; your WebSocket will disconnect.
12.
Migrate to self-hosted fly.io when you need speaker diarization, custom vocabulary, or when you hit > 500 concurrent sessions and Cloudflare becomes more expensive than a reserved GPU.

The combination of edge computing, smart VAD filtering, and streaming WebSocket design brings Whisper-quality transcription down to 80–180ms end-to-end — fast enough that users stop noticing the latency entirely.

Whisper WebSockets Edge Computing Audio AI Real-time Streaming

Sachin Sharma

Software Developer & Mobile Engineer

Building digital experiences at the intersection of design and code. Sharing weekly insights on engineering, productivity, and the future of tech.

Deep Dive into Flutter Impeller: How Impeller Eliminates Shader Compilation Jitter (Jank)

Learn how Flutter's next-generation rendering engine, Impeller, works. Compare Skia's runtime compilation with Impeller's ahead-of-time (AOT) Vulkan/Metal shader compile pipelines.

WebAssembly as an Autonomous Agent Sandbox: Running Untrusted AI Code in Node.js Safely

LLMs that write and execute code are powerful — and dangerous. Learn how WASM Component Model, WASI preview 2, and Wasmtime turn Node.js into a fortress for autonomous agent code execution.