Member-only story
“Reversible Residual Layers” For Transformer’s Memory Issues?
There are a lot of deep explanations elsewhere about reformers and reversible residual layers, here I’d like to share some example questions in an interview setting.
How does “Reversible Residual Layers” solve memory issues of training transformers?
Here are some example answers for readers’ reference:
When running large deep models, you will often run out of memory as each layer allocates memory to store activations for use in backpropagation. To save this resource, you need to be able to recompute these activations during the backward pass without storing them during the forward pass. See the left diagram above.
This is how the residual networks are implemented in the standard Transformer. It follows that, given
F()
is Attention andG()
is Feed-forward(FF).It requires that x and ya be saved so it can be used during backpropagation. We want to avoid this to conserve memory and this is where reversible residual connections come in. They are shown in the middle and rightmost diagrams above. The key idea is that we…