News Score: Score the News, Sort the News, Rewrite the Headlines

Optimizing Matrix Multiplication on RDNA3: 50 TFlops and 60% Faster Than rocBLAS

Introduction Hi everyone ! In this post, I will share with you all the steps to write an optimized FP32 matrix multiplication on AMD RDNA3 GPU outperforming rocBLAS by 60%. I will cover some basics and explain all the optimizations I have implemented. This will be done in a iterative way in 8 differents Kernels. Figure 1: sneak peek of the performance results I primary intended to work on this to deepen my understanding of RDNA3 and try out HIP and I felt like I needed to share what I learned do...

Read more at seb-v.github.io

© News Score  score the news, sort the news, rewrite the headlines