GitHub - tspeterkim/flash-attention-minimal: Flash Attention in ~100 lines of CUDA (forward pass only)
flash-attention-minimal
A minimal re-implementation of Flash Attention with CUDA and PyTorch.
The official implementation can be quite daunting for a CUDA beginner
(like myself), so this repo tries to be small and educational.
The entire forward pass is written in ~100 lines in flash.cu.
The variable names follow the notations from the original paper.
Usage
Prerequisite
PyTorch (with CUDA)
Ninja for loading in C++
Benchmark
Compare the wall-clock time between manual attention and minimal flash a...
Read more at github.com