How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores
Introduction
Background
The memory wall
Roofline charts
Rooflines for the NVIDIA Tesla T4
Tensor Core vs. FFMA
Shared memory vs. L2 cache vs. global memory
Theoretical arithmetic intensity
Matrix Multiplication vs Matrix Addition
Achievable arithmetic intensity on a simple computer
worst case
best case
realistic case
In Summary
Parallelized matrix multiplication on a GPU
Hierarchical Tiling (simple gpu)
Hierarchical Tiling (real gpu)
Performance considerations on a real GPU
Arithmetic intensity ...
Read more at alexarmbr.github.io