"High-Performance Matrix Multiplication C Code Outperforms NumPy; Optimized for AMD Ryzen 7700, Achieves Over 1 TFLOPS in Wide Matrix Range"

Beating NumPy’s matrix multiplication in 150 lines of C code

TL;DR The code from the tutorial is available at matmul.c. This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes. By efficiently parallelizing the code wit...