A Gentle Introduction to CUDA PTX | Philip Fabianek
Introduction
As a CUDA developer, you might not interact with Parallel Thread Execution (PTX) every day, but it is the fundamental layer between your CUDA code and the hardware. Understanding it is essential for deep performance analysis and for accessing the latest hardware features, sometimes long before they are exposed in C++. For example, the wgmma instructions, which perform warpgroup-level matrix operations and are used in some of the most performant GEMM kernels, are available only throu...
Read more at philipfabianek.com