News Score: Score the News, Sort the News, Rewrite the Headlines

How To Write A Fast Matrix Multiplication From Scratch With Tensor Cores

Introduction Background The memory wall Roofline charts Rooflines for the NVIDIA Tesla T4 Tensor Core vs. FFMA Shared memory vs. L2 cache vs. global memory Theoretical arithmetic intensity Matrix Multiplication vs Matrix Addition Achievable arithmetic intensity on a simple computer worst case best case realistic case In Summary Parallelized matrix multiplication on a GPU Hierarchical Tiling (simple gpu) Hierarchical Tiling (real gpu) Performance considerations on a real GPU Arithmetic intensity ...

Read more at alexarmbr.github.io

© News Score  score the news, sort the news, rewrite the headlines