GPU Expert Reveals Step-by-Step Guide to Optimize Matrix Multiplication with Tensor Cores on NVIDIA Tesla T4

How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores

Introduction Background The memory wall Roofline charts Rooflines for the NVIDIA Tesla T4 Tensor Core vs. FFMA Shared memory vs. L2 cache vs. global memory Theoretical arithmetic intensity Matrix Multiplication vs Matrix Addition Achievable arithmetic intensity on a simple computer worst case best case realistic case In Summary Parallelized matrix multiplication on a GPU Hierarchical Tiling (simple gpu) Hierarchical Tiling (real gpu) Performance considerations on a real GPU Arithmetic intensity ...