Mojo Achieves 84% Bandwidth Efficiency in Matrix Transpose on Hopper Architecture, Rivaling CUDA Performance

Highly efficient matrix transpose in Mojo 🔥

06 Jun, 2025 In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo. The best kernel archives a bandwidth of 2775.49 GB/s, i.e. 84.1056%. The optimisations are the same that I applied to archive a bandwidth of 2771.35 GB/s using pure CUDA on the same H100 that I use here. That shows that Mojo can archive CUDA like performance on exactly the same task. You may compare the kernels with the previous kernels I wrote ...