"GitHub User Simplifies Flash Attention Implementation with CUDA and PyTorch, Enhances Speed in Comparison Test"

GitHub - tspeterkim/flash-attention-minimal: Flash Attention in ~100 lines of CUDA (forward pass only)

flash-attention-minimal A minimal re-implementation of Flash Attention with CUDA and PyTorch. The official implementation can be quite daunting for a CUDA beginner (like myself), so this repo tries to be small and educational. The entire forward pass is written in ~100 lines in flash.cu. The variable names follow the notations from the original paper. Usage Prerequisite PyTorch (with CUDA) Ninja for loading in C++ Benchmark Compare the wall-clock time between manual attention and minimal flash a...