Highly efficient matrix transpose in Mojo 🔥
06 Jun, 2025
In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo.
The best kernel archives a bandwidth of 2775.49 GB/s, i.e. 84.1056%. The optimisations are the same that I applied to archive a bandwidth of 2771.35 GB/s using pure CUDA on the same H100 that I use here. That shows that Mojo can archive CUDA like performance on exactly the same task. You may compare the kernels with the previous kernels I wrote ...
Read more at veitner.bearblog.dev