Stanford researchers boost Llama-1B speed by 1.5x with megakernel design, achieving 78% GPU bandwidth utilization

Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B

There are some applications that benefit from running LLMs really, really fast. This low-latency regime encompasses applications like chatbots and human-in-the-loop workflows, where users care a lot about seeing responses come back immediately. Given the importance of these low-latency workloads, we wanted to explore just how fast we can run open-source models on modern GPUs. To really stress-test existing systems, we consider an aggressive low-latency scenario where we generate a single sequenc...