Writing Custom WebGPU Compute Shaders: High-Performance Matrix Multiplication (MatMul) in WGSL
Master WebGPU GPGPU programming. Walk through writing a high-performance Matrix Multiplication shader in WebGPU Shading Language (WGSL) with shared memory tile optimization.

Master WebGPU GPGPU programming. Walk through writing a high-performance Matrix Multiplication shader in WebGPU Shading Language (WGSL) with shared memory tile optimization.
Writing Custom WebGPU Compute Shaders: High-Performance Matrix Multiplication (MatMul) in WGSL
In modern web development, browser-native machine learning and graphics are transforming user experiences. Under the hood, operations like Large Language Model (LLM) inference, physics simulations, and image processing boil down to a single mathematical operation: Matrix Multiplication (MatMul).
While APIs like WebGL were abused for compute, WebGPU brings direct support for Compute Shaders and GPGPU (General-Purpose computing on GPUs).
In this systems-level guide, we will write a custom Matrix Multiplication compute shader in WGSL (WebGPU Shading Language) and build the JavaScript runner to compile and execute it.
⚡ 1. The Compute Pipeline Architecture
Unlike rendering graphics where we deal with vertices and fragments, a compute pipeline performs arbitrary calculations over structured buffers.
- 2.Host Memory: JavaScript allocates flat Float32Arrays for Matrix A, Matrix B, and the Output Matrix.
- 4.GPU Buffers: We copy these arrays into dedicated GPU-side memory buffers (Storage Buffers).
- 6.WGSL Compute Shader: A program written in WGSL that defines how parallel processing threads (workgroups) read from input buffers, execute matrix math, and write to the output buffer.
- 8.Command Encoder: Records commands to submit the compute pass to the GPU queue.
[JS Matrix Arrays] ──(Map Write)──> [GPU Storage Buffers]
│
[WebGPU Compute Pass Encoder]
│
[WGSL Compute Shader (GPU)]
│
[JS Read Buffer] <──(Command Copy) ── [GPU Output Buffer]
🏗️ 2. Writing the WGSL Compute Shader
WGSL defines execution threads in grids. We organize our threads into Workgroups (e.g. 16x16 threads).
Here is the WGSL matrix multiplication shader. It uses a simple, clean algorithm where each thread computes a single cell of the output matrix.
rust// matmul.wgsl // Structs representing matrix dimensions struct MatrixInfo { widthA: u32, heightA: u32, widthB: u32, heightB: u32, } @group(0) @binding(0) var<storage, read> matrixA : array<f32>; @group(0) @binding(1) var<storage, read> matrixB : array<f32>; @group(0) @binding(2) var<storage, read_write> matrixOut : array<f32>; @group(0) @binding(3) var<uniform> info : MatrixInfo; @compute @workgroup_size(16, 16) fn main( @builtin(global_invocation_id) global_id : vec3<u32> ) { let row = global_id.y; let col = global_id.x; // Boundary checks to ensure we do not write outside the output matrix bounds if (row >= info.heightA || col >= info.widthB) { return; } var sum: f32 = 0.0; for (var k: u32 = 0u; k < info.widthA; k = k + 1u) { let indexA = row * info.widthA + k; let indexB = k * info.widthB + col; sum = sum + matrixA[indexA] * matrixB[indexB]; } let indexOut = row * info.widthB + col; matrixOut[indexOut] = sum; }
💻 3. Building the JavaScript WebGPU Runner
Let's write the JavaScript code to request the GPU device, build the buffers, compile our WGSL code, and run the compute pass.
javascriptasync function executeWebGPUMatMul() { // 1. Request GPU Adapter and Device const adapter = await navigator.gpu?.requestAdapter(); const device = await adapter?.requestDevice(); if (!device) { console.error("WebGPU is not supported on this browser."); return; } // Define Matrix dimensions (e.g. 512 x 512) const sizeX = 512; const sizeY = 512; const arrayA = new Float32Array(sizeX * sizeY).fill(1.5); const arrayB = new Float32Array(sizeX * sizeY).fill(2.0); const arrayOut = new Float32Array(sizeX * sizeY); // 2. Allocate storage buffers on the GPU const gpuBufferA = device.createBuffer({ size: arrayA.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST, }); device.queue.writeBuffer(gpuBufferA, 0, arrayA); const gpuBufferB = device.createBuffer({ size: arrayB.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST, }); device.queue.writeBuffer(gpuBufferB, 0, arrayB); const gpuBufferOut = device.createBuffer({ size: arrayOut.byteLength, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC, }); // 3. Allocate a uniform buffer for Matrix dimensions const infoArray = new Uint32Array([sizeX, sizeY, sizeX, sizeY]); const gpuBufferInfo = device.createBuffer({ size: infoArray.byteLength, usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST, }); device.queue.writeBuffer(gpuBufferInfo, 0, infoArray); // 4. Load and Compile WGSL Code const shaderModule = device.createShaderModule({ code: ` struct MatrixInfo { widthA: u32, heightA: u32, widthB: u32, heightB: u32, } @group(0) @binding(0) var<storage, read> matrixA : array<f32>; @group(0) @binding(1) var<storage, read> matrixB : array<f32>; @group(0) @binding(2) var<storage, read_write> matrixOut : array<f32>; @group(0) @binding(3) var<uniform> info : MatrixInfo; @compute @workgroup_size(16, 16) fn main(@builtin(global_invocation_id) global_id : vec3<u32>) { let row = global_id.y; let col = global_id.x; if (row >= info.heightA || col >= info.widthB) { return; } var sum = 0.0; for (var k = 0u; k < info.widthA; k = k + 1u) { sum = sum + matrixA[row * info.widthA + k] * matrixB[k * info.widthB + col]; } matrixOut[row * info.widthB + col] = sum; } `, }); // 5. Create Bind Group Layout and Bind Group const bindGroup = device.createBindGroup({ layout: device.createComputePipeline({ layout: 'auto', compute: { module: shaderModule, entryPoint: 'main' } }).getBindGroupLayout(0), entries: [ { binding: 0, resource: { buffer: gpuBufferA } }, { binding: 1, resource: { buffer: gpuBufferB } }, { binding: 2, resource: { buffer: gpuBufferOut } }, { binding: 3, resource: { buffer: gpuBufferInfo } }, ], }); // 6. Define Compute Pipeline const pipeline = device.createComputePipeline({ layout: 'auto', compute: { module: shaderModule, entryPoint: 'main', }, }); // 7. Record & Execute Commands const commandEncoder = device.createCommandEncoder(); const passEncoder = commandEncoder.beginComputePass(); passEncoder.setPipeline(pipeline); passEncoder.setBindGroup(0, bindGroup); // Calculate dispatch size based on workgroup size (16, 16) const workgroupCountX = Math.ceil(sizeX / 16); const workgroupCountY = Math.ceil(sizeY / 16); passEncoder.dispatchWorkgroups(workgroupCountX, workgroupCountY); passEncoder.end(); // 8. Copy GPU buffer back to Host Staging buffer for reading const gpuReadBuffer = device.createBuffer({ size: arrayOut.byteLength, usage: GPUBufferUsage.COPY_DST | GPUBufferUsage.MAP_READ, }); commandEncoder.copyBufferToBuffer(gpuBufferOut, 0, gpuReadBuffer, 0, arrayOut.byteLength); // Submit to GPU Queue device.queue.submit([commandEncoder.finish()]); // Map GPU memory to JavaScript space await gpuReadBuffer.mapAsync(GPUMapMode.READ); const resultData = new Float32Array(gpuReadBuffer.getMappedRange()); console.log(`📊 Computation completed. First value: \${resultData[0]}`); // Output: 1536 (512 * 1.5 * 2.0) gpuReadBuffer.unmap(); }
🚀 4. Optimization: Shared Memory Tiling
The naive shader above is memory-bandwidth bound. For each multiply-accumulate operation, the GPU must fetch data from slow global memory.
To optimize this, we load sub-tiles of Matrix A and Matrix B into high-speed Workgroup Shared Memory (var<workgroup>) once, allowing threads within the workgroup to share and reuse the data, reducing global memory fetches by a factor of 16.
🏁 5. Conclusion
Writing custom compute shaders in WGSL allows web developers to unlock raw GPU power directly in the browser tab. By moving from graphics hacks (like rendering 2D fragments to trigger WebGL calculations) to pure Compute pipelines with WebGPU, you gain maximum compute throughput and access to low-level GPU hardware design.

Designing a Multi-Region Postgres Topology: Read Replicas, Logical Replication, and Safe Failover
A production-grade guide to designing highly available, low-latency multi-region PostgreSQL databases using logical replication, proxy geo-routing, and automated failover mechanics.

Building a Collaborative Whiteboard with WebRTC Mesh and Yjs CRDTs: Zero-Server Real-Time Vector Drawing
Learn how to build a fully decentralized real-time collaborative whiteboard. Synchronize dynamic freehand vectors and cursors using WebRTC and Yjs CRDTs.