"GPU Engineer Optimizes CUDA Matrix Multiplication: Achieves 93.7% of cuBLAS Performance Through Iterative Improvements"

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog

In this post, I’ll iteratively optimize an implementation of matrix multiplication written in CUDA. My goal is not to build a cuBLAS replacement, but to deeply understand the most important performance characteristics of the GPUs that are used for modern deep learning. This includes coalescing global memory accesses, shared memory caching and occupancy optimizations, among others.You can download the code for all kernels from Github. Also checkout wangzyon’s repo from which I copied the benchmar...