AMD RDNA3 GPU Matrix Multiplication Optimization Yields 50 TFlops, Outperforms rocBLAS by 60%

Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS

Introduction Hi everyone ! In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels. Figure 1: sneak peek of the performance results I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learned do...