News Score: Score the News, Sort the News, Rewrite the Headlines

Highly efficient matrix transpose in Mojo 🔥

06 Jun, 2025 In this blogpost I will step by step show you how to implement a highly efficient transpose kernel for the Hopper architecture using Mojo. The best kernel archives a bandwidth of 2775.49 GB/s, i.e. 84.1056%. The optimisations are the same that I applied to archive a bandwidth of 2771.35 GB/s using pure CUDA on the same H100 that I use here. That shows that Mojo can archive CUDA like performance on exactly the same task. You may compare the kernels with the previous kernels I wrote ...

Read more at veitner.bearblog.dev

© News Score  score the news, sort the news, rewrite the headlines