Modern Web

Writing Custom WebGPU Compute Shaders: High-Performance Matrix Multiplication (MatMul) in WGSL

Master WebGPU GPGPU programming. Walk through writing a high-performance Matrix Multiplication shader in WebGPU Shading Language (WGSL) with shared memory tile optimization.

Sachin SharmaCreator

Jun 4, 2026

6 min read

Writing Custom WebGPU Compute Shaders: High-Performance Matrix Multiplication (MatMul) in WGSL

Featured Resource

Quick Overview

Master WebGPU GPGPU programming. Walk through writing a high-performance Matrix Multiplication shader in WebGPU Shading Language (WGSL) with shared memory tile optimization.

Writing Custom WebGPU Compute Shaders: High-Performance Matrix Multiplication (MatMul) in WGSL

In modern web development, browser-native machine learning and graphics are transforming user experiences. Under the hood, operations like Large Language Model (LLM) inference, physics simulations, and image processing boil down to a single mathematical operation: Matrix Multiplication (MatMul).

While APIs like WebGL were abused for compute, WebGPU brings direct support for Compute Shaders and GPGPU (General-Purpose computing on GPUs).

In this systems-level guide, we will write a custom Matrix Multiplication compute shader in WGSL (WebGPU Shading Language) and build the JavaScript runner to compile and execute it.

⚡ 1. The Compute Pipeline Architecture

Unlike rendering graphics where we deal with vertices and fragments, a compute pipeline performs arbitrary calculations over structured buffers.

2.
Host Memory: JavaScript allocates flat Float32Arrays for Matrix A, Matrix B, and the Output Matrix.
4.
GPU Buffers: We copy these arrays into dedicated GPU-side memory buffers (Storage Buffers).
6.
WGSL Compute Shader: A program written in WGSL that defines how parallel processing threads (workgroups) read from input buffers, execute matrix math, and write to the output buffer.
8.
Command Encoder: Records commands to submit the compute pass to the GPU queue.

[JS Matrix Arrays] ──(Map Write)──> [GPU Storage Buffers]
                                            │
                             [WebGPU Compute Pass Encoder]
                                            │
                             [WGSL Compute Shader (GPU)]
                                            │
[JS Read Buffer] <──(Command Copy) ── [GPU Output Buffer]

🏗️ 2. Writing the WGSL Compute Shader

WGSL defines execution threads in grids. We organize our threads into Workgroups (e.g. 16x16 threads).

Here is the WGSL matrix multiplication shader. It uses a simple, clean algorithm where each thread computes a single cell of the output matrix.


rust
// matmul.wgsl

// Structs representing matrix dimensions
struct MatrixInfo {
  widthA: u32,
  heightA: u32,
  widthB: u32,
  heightB: u32,
}

@group(0) @binding(0) var<storage, read> matrixA : array<f32>;
@group(0) @binding(1) var<storage, read> matrixB : array<f32>;
@group(0) @binding(2) var<storage, read_write> matrixOut : array<f32>;
@group(0) @binding(3) var<uniform> info : MatrixInfo;

@compute @workgroup_size(16, 16)
fn main(
  @builtin(global_invocation_id) global_id : vec3<u32>
) {
  let row = global_id.y;
  let col = global_id.x;

  // Boundary checks to ensure we do not write outside the output matrix bounds
  if (row >= info.heightA || col >= info.widthB) {
    return;
  }

  var sum: f32 = 0.0;
  for (var k: u32 = 0u; k < info.widthA; k = k + 1u) {
    let indexA = row * info.widthA + k;
    let indexB = k * info.widthB + col;
    sum = sum + matrixA[indexA] * matrixB[indexB];
  }

  let indexOut = row * info.widthB + col;
  matrixOut[indexOut] = sum;
}

💻 3. Building the JavaScript WebGPU Runner

Let's write the JavaScript code to request the GPU device, build the buffers, compile our WGSL code, and run the compute pass.


javascript
async function executeWebGPUMatMul() {
  // 1. Request GPU Adapter and Device
  const adapter = await navigator.gpu?.requestAdapter();
  const device = await adapter?.requestDevice();
  if (!device) {
    console.error("WebGPU is not supported on this browser.");
    return;
  }

  // Define Matrix dimensions (e.g. 512 x 512)
  const sizeX = 512;
  const sizeY = 512;

  const arrayA = new Float32Array(sizeX * sizeY).fill(1.5);
  const arrayB = new Float32Array(sizeX * sizeY).fill(2.0);
  const arrayOut = new Float32Array(sizeX * sizeY);

  // 2. Allocate storage buffers on the GPU
  const gpuBufferA = device.createBuffer({
    size: arrayA.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(gpuBufferA, 0, arrayA);

  const gpuBufferB = device.createBuffer({
    size: arrayB.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(gpuBufferB, 0, arrayB);

  const gpuBufferOut = device.createBuffer({
    size: arrayOut.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });

  // 3. Allocate a uniform buffer for Matrix dimensions
  const infoArray = new Uint32Array([sizeX, sizeY, sizeX, sizeY]);
  const gpuBufferInfo = device.createBuffer({
    size: infoArray.byteLength,
    usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(gpuBufferInfo, 0, infoArray);

  // 4. Load and Compile WGSL Code
  const shaderModule = device.createShaderModule({
    code: `
      struct MatrixInfo {
        widthA: u32,
        heightA: u32,
        widthB: u32,
        heightB: u32,
      }
      @group(0) @binding(0) var<storage, read> matrixA : array<f32>;
      @group(0) @binding(1) var<storage, read> matrixB : array<f32>;
      @group(0) @binding(2) var<storage, read_write> matrixOut : array<f32>;
      @group(0) @binding(3) var<uniform> info : MatrixInfo;

      @compute @workgroup_size(16, 16)
      fn main(@builtin(global_invocation_id) global_id : vec3<u32>) {
        let row = global_id.y;
        let col = global_id.x;
        if (row >= info.heightA || col >= info.widthB) { return; }
        var sum = 0.0;
        for (var k = 0u; k < info.widthA; k = k + 1u) {
          sum = sum + matrixA[row * info.widthA + k] * matrixB[k * info.widthB + col];
        }
        matrixOut[row * info.widthB + col] = sum;
      }
    `,
  });

  // 5. Create Bind Group Layout and Bind Group
  const bindGroup = device.createBindGroup({
    layout: device.createComputePipeline({
      layout: 'auto',
      compute: { module: shaderModule, entryPoint: 'main' }
    }).getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: gpuBufferA } },
      { binding: 1, resource: { buffer: gpuBufferB } },
      { binding: 2, resource: { buffer: gpuBufferOut } },
      { binding: 3, resource: { buffer: gpuBufferInfo } },
    ],
  });

  // 6. Define Compute Pipeline
  const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: {
      module: shaderModule,
      entryPoint: 'main',
    },
  });

  // 7. Record & Execute Commands
  const commandEncoder = device.createCommandEncoder();
  const passEncoder = commandEncoder.beginComputePass();
  passEncoder.setPipeline(pipeline);
  passEncoder.setBindGroup(0, bindGroup);
  
  // Calculate dispatch size based on workgroup size (16, 16)
  const workgroupCountX = Math.ceil(sizeX / 16);
  const workgroupCountY = Math.ceil(sizeY / 16);
  passEncoder.dispatchWorkgroups(workgroupCountX, workgroupCountY);
  passEncoder.end();

  // 8. Copy GPU buffer back to Host Staging buffer for reading
  const gpuReadBuffer = device.createBuffer({
    size: arrayOut.byteLength,
    usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ,
  });
  commandEncoder.copyBufferToBuffer(gpuBufferOut, 0, gpuReadBuffer, 0, arrayOut.byteLength);

  // Submit to GPU Queue
  device.queue.submit([commandEncoder.finish()]);

  // Map GPU memory to JavaScript space
  await gpuReadBuffer.mapAsync(GPUMapMode.READ);
  const resultData = new Float32Array(gpuReadBuffer.getMappedRange());
  console.log(`📊 Computation completed. First value: \${resultData[0]}`); // Output: 1536 (512 * 1.5 * 2.0)
  
  gpuReadBuffer.unmap();
}

🚀 4. Optimization: Shared Memory Tiling

The naive shader above is memory-bandwidth bound. For each multiply-accumulate operation, the GPU must fetch data from slow global memory.

To optimize this, we load sub-tiles of Matrix A and Matrix B into high-speed Workgroup Shared Memory (var<workgroup>) once, allowing threads within the workgroup to share and reuse the data, reducing global memory fetches by a factor of 16.

🏁 5. Conclusion

Writing custom compute shaders in WGSL allows web developers to unlock raw GPU power directly in the browser tab. By moving from graphics hacks (like rendering 2D fragments to trigger WebGL calculations) to pure Compute pipelines with WebGPU, you gain maximum compute throughput and access to low-level GPU hardware design.

WebGPU WGSL Compute Shaders Matrix Multiplication GPGPU Performance WebAssembly

Sachin Sharma

Software Developer

Building digital experiences at the intersection of design and code. Sharing weekly insights on engineering, productivity, and the future of tech.

Designing a Multi-Region Postgres Topology: Read Replicas, Logical Replication, and Safe Failover

A production-grade guide to designing highly available, low-latency multi-region PostgreSQL databases using logical replication, proxy geo-routing, and automated failover mechanics.

Building a Collaborative Whiteboard with WebRTC Mesh and Yjs CRDTs: Zero-Server Real-Time Vector Drawing

Learn how to build a fully decentralized real-time collaborative whiteboard. Synchronize dynamic freehand vectors and cursors using WebRTC and Yjs CRDTs.