Writing high-performance matrix multiplication kernels for Blackwell — JAX documentation
Writing high-performance matrix multiplication kernels for Blackwell#
In this guide, we’ll progressively iterate on a matrix multiplication kernel.
The first implementation will be very simple, but also quite slow.
However, in just a few simple steps it can be modified into a state-of-the-art
kernel, matching or exceeding highly optimized implementations such as cuBLAS
and CUTLASS.
Warning
The utilization shown in the table below might be different than what you see online,
but the differences c...
Read more at docs.jax.dev