Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS
Introduction
Hi everyone !
In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels.
Figure 1: sneak peek of the performance results
I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learned do...
Read more at seb-v.github.io