Deep learning performance optimization requires understanding three bottlenecks—compute, memory bandwidth, overhead—with focus on maximizing GPU compute utilization as FLOPS growth outpaces memory bandwidth improvements.

Making Deep Learning go Brrrr From First Principles

So, you want to improve the performance of your deep learning model. How might you approach such a task? Often, folk fall back to a grab-bag of tricks that might've worked before or saw on a tweet. "Use in-place operations! Set gradients to None! Install PyTorch 1.10.0 but not 1.10.1!" It's understandable why users often take such an ad-hoc approach performance on modern systems (particularly deep learning) often feels as much like alchemy as it does science. That being said, reasoning from firs...