Long-Sequence Attention with ⚡FlashAttention⚡

Angelina Yang
4 min readJan 30, 2023

The new year opens with lots of discussions of interesting new papers, following the high tide of the ChatGPT. The intro of FlashAttention is among one of the best ones. The main problem it is addressing is an important one for the Transformer architectures — accelerating the speed and improving memory consumption for self-attention operations.

Why is it interesting?

One way to recognize a good paper or a new method is to learn how quickly it is adopted/adapted by the open-source world and the industry.

According to the author of the paper (Tri Dao), it has already been implemented into Megatron LM from Nvidia and Triton of Openai as of two weeks ago. The implementation is planned to be available in the up-coming Pytorch 2.0 release in March; and more collaborators are joining including Huggingface (integrated in Diffusers), Microsoft’s DeepSpeed, Meta’s AITemplate and so on.

What is FlashAttention?

In short,

FlashAttention is a fast and memory-efficient algorithm to compute exact attention. It speeds up model training and reduces memory requirements.

The motivation for this is as follows:

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length.

--

--