One gets the most efficient performance when batch sizes and input/output neuron counts are divisible by a certain number, which typically starts at 8, but can be much higher as well. Let’s start with a simple optimization: choosing the right batch size. So there are potentially a few places where we could save GPU memory or speed up operations. activation forward reads once, writes once, activation backward reads twice, gradOutput and output of the forward, and writes once, gradInput). Activations are usually bandwidth-limited, and it’s typical for an activation to have to read more data in the backward than in the forward (e.g. For example, when generating text using beam search, the software needs to maintain multiple copies of inputs and outputs.įor convolutions and linear layers there are 2x flops in the backward compared to the forward, which generally translates into ~2x slower (sometimes more, because sizes in the backward tend to be more awkward). Then your software could have special memory needs. Therefore when coding it’s crucial to think strategically about such temporary variables and sometimes to explicitly free those as soon as they are no longer needed. There are the input and output that are being passed and returned by the forward and the backward functions and the forward activations saved for gradient computation.Īdditionally there are all kinds of temporary variables which get released once the calculation is done, but in the moment these could require additional memory and could push to OOM. size depends on many factors, the key ones being sequence length, hidden size and batch size.4 bytes * number of parameters for either fp32 or mixed precision training (gradients are always kept in fp32).4 bytes * number of parameters for optimizers like SGD with momentum (maintains only 1 state).2 bytes * number of parameters for 8-bit AdamW optimizers like bitsandbytes.8 bytes * number of parameters for normal AdamW (maintains 2 states).6 bytes * number of parameters for mixed precision training (maintains a model in fp32 and one in fp16 in memory).4 bytes * number of parameters for fp32 training.And thus we end up with 6 bytes per model parameter for mixed precision inference, plus activation memory. For inference there are no optimizer states and gradients, so we can subtract those. forward activations saved for gradient computationĪ typical model trained in mixed precision with AdamW requires 18 bytes per model parameter plus activation memory. The components on GPU memory are the following:Ĥ. This is because there are many components during training that use GPU memory. We've seen that training the model uses much more memory than just putting the model on the GPU. This summary is derived from Data Movement Is All You Need: A Case Study on Optimizing Transformers 2020 Anatomy of Model's Memory This knowledge can be helpful to know when analyzing performance bottlenecks. These are the least compute-intensive operations. These are the remaining operators: biases, dropout, activations, and residual connections. Softmax and layer normalization are less compute-intensive than tensor contractions, and involve one or more reduction operations, the result of which is then applied via a map. These operations are the most compute-intensive part of training a transformer. Linear layers and components of Multi-Head Attention all do batched matrix-matrix multiplications. Transformers architecture includes 3 main groups of operations grouped below by compute-intensity. To understand a bit better why this is the case let’s have look at a model’s operations and memory needs. What’s interesting is that we use much more memory than the size of the model. So ideally we want to tune the batch size to our model’s needs and not to the GPU limitations. However, a larger batch size can often result in faster model convergence or better end performance. We see that already a relatively small batch size almost fills up our GPU’s entire memory. First, we set up a few standard training arguments that we will use across all our experiments: So now we can start training the model and see how the GPU memory consumption changes. We get the same number as before and you can also see that we are using a V100 GPU with 16GB of memory. | GPU GI CI PID Type Process name GPU Memory | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |