<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Training Optimization | Mahyar's world 🌏</title><link>https://mahyar-osanlouy.com/tag/training-optimization/</link><atom:link href="https://mahyar-osanlouy.com/tag/training-optimization/index.xml" rel="self" type="application/rss+xml"/><description>Training Optimization</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Mon, 06 Jan 2025 00:00:00 +0000</lastBuildDate><image><url>https://mahyar-osanlouy.com/media/icon_hu35e4e9c9135f02752aab27d124db531b_75212_512x512_fill_lanczos_center_3.png</url><title>Training Optimization</title><link>https://mahyar-osanlouy.com/tag/training-optimization/</link></image><item><title>Supercharge Your PyTorch Training with Gradient Accumulation</title><link>https://mahyar-osanlouy.com/post/gradient-accumulation-pytorch/</link><pubDate>Mon, 06 Jan 2025 00:00:00 +0000</pubDate><guid>https://mahyar-osanlouy.com/post/gradient-accumulation-pytorch/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>When training large deep learning models, you often face a fundamental limitation: GPU memory. Larger batch sizes generally lead to more stable training and sometimes better convergence, but what if your GPU simply can&amp;rsquo;t handle the memory requirements of your ideal batch size?&lt;/p>
&lt;p>Enter gradient accumulation - a simple yet powerful technique that allows you to effectively increase your batch size without increasing memory usage. In this post, I&amp;rsquo;ll show you how to implement this technique in PyTorch and explain why it might be exactly what your training pipeline needs.&lt;/p>
&lt;h2 id="what-is-gradient-accumulation">What is Gradient Accumulation?&lt;/h2>
&lt;p>Gradient accumulation is a technique where you:&lt;/p>
&lt;ul>
&lt;li>Process smaller mini-batches sequentially&lt;/li>
&lt;li>Accumulate (add up) their gradients&lt;/li>
&lt;li>Update your model weights only after processing several mini-batches&lt;/li>
&lt;/ul>
&lt;p>This simulates training on a larger batch size without the memory requirements of loading that entire batch at once. It&amp;rsquo;s particularly useful when:&lt;/p>
&lt;ul>
&lt;li>You&amp;rsquo;re training very large models&lt;/li>
&lt;li>Working with limited GPU resources&lt;/li>
&lt;li>Need the stability of larger batch sizes&lt;/li>
&lt;/ul>
&lt;h2 id="implementing-gradient-accumulation-in-pytorch">Implementing Gradient Accumulation in PyTorch&lt;/h2>
&lt;p>The implementation is surprisingly straightforward. Here&amp;rsquo;s a complete working example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> torch
&lt;span style="color:#f92672">import&lt;/span> torch.nn &lt;span style="color:#66d9ef">as&lt;/span> nn
&lt;span style="color:#f92672">import&lt;/span> torch.optim &lt;span style="color:#66d9ef">as&lt;/span> optim
&lt;span style="color:#f92672">from&lt;/span> torch.utils.data &lt;span style="color:#f92672">import&lt;/span> DataLoader, TensorDataset
&lt;span style="color:#75715e"># Create a simple dataset&lt;/span>
features, targets &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>randn(&lt;span style="color:#ae81ff">1200&lt;/span>, &lt;span style="color:#ae81ff">8&lt;/span>), torch&lt;span style="color:#f92672">.&lt;/span>randn(&lt;span style="color:#ae81ff">1200&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>)
dataset &lt;span style="color:#f92672">=&lt;/span> TensorDataset(features, targets)
data_loader &lt;span style="color:#f92672">=&lt;/span> DataLoader(dataset, batch_size&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">40&lt;/span>, shuffle&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)
&lt;span style="color:#75715e"># Define a basic neural network&lt;/span>
model &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>Sequential(
nn&lt;span style="color:#f92672">.&lt;/span>Linear(&lt;span style="color:#ae81ff">8&lt;/span>, &lt;span style="color:#ae81ff">16&lt;/span>),
nn&lt;span style="color:#f92672">.&lt;/span>ReLU(),
nn&lt;span style="color:#f92672">.&lt;/span>Linear(&lt;span style="color:#ae81ff">16&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>)
)
loss_fn &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>MSELoss()
optimizer &lt;span style="color:#f92672">=&lt;/span> optim&lt;span style="color:#f92672">.&lt;/span>SGD(model&lt;span style="color:#f92672">.&lt;/span>parameters(), lr&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.01&lt;/span>)
accumulation_steps &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span>
num_epochs &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;span style="color:#66d9ef">for&lt;/span> epoch &lt;span style="color:#f92672">in&lt;/span> range(num_epochs):
&lt;span style="color:#66d9ef">for&lt;/span> batch_idx, (inputs, labels) &lt;span style="color:#f92672">in&lt;/span> enumerate(data_loader):
outputs &lt;span style="color:#f92672">=&lt;/span> model(inputs)
loss &lt;span style="color:#f92672">=&lt;/span> loss_fn(outputs, labels) &lt;span style="color:#f92672">/&lt;/span> accumulation_steps
loss&lt;span style="color:#f92672">.&lt;/span>backward()
&lt;span style="color:#66d9ef">if&lt;/span> (batch_idx &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#f92672">%&lt;/span> accumulation_steps &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
optimizer&lt;span style="color:#f92672">.&lt;/span>step()
optimizer&lt;span style="color:#f92672">.&lt;/span>zero_grad()
print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Epoch &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>epoch &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>num_epochs&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">, Loss: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>loss&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">*&lt;/span> accumulation_steps&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">.4f&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
print(&lt;span style="color:#e6db74">&amp;#34;Training finished&amp;#34;&lt;/span>)
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Let&amp;rsquo;s break down the key components:&lt;/p>
&lt;h3 id="1-set-your-accumulation-steps">1. Set your accumulation steps&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">accumulation_steps &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>This defines how many mini-batches to process before updating model weights.&lt;/p>
&lt;h3 id="2-adjust-your-loss-calculation">2. Adjust your loss calculation&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">loss &lt;span style="color:#f92672">=&lt;/span> loss_fn(outputs, labels) &lt;span style="color:#f92672">/&lt;/span> accumulation_steps
&lt;/code>&lt;/pre>&lt;/div>&lt;p>We divide the loss by the number of accumulation steps to ensure the gradients are properly scaled.&lt;/p>
&lt;h3 id="3-accumulate-gradients-but-delay-the-optimizer-step">3. Accumulate gradients but delay the optimizer step&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">loss&lt;span style="color:#f92672">.&lt;/span>backward()
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Call backward() as usual, but don&amp;rsquo;t immediately call optimizer.step().&lt;/p>
&lt;h3 id="4-update-weights-after-accumulation">4. Update weights after accumulation&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#66d9ef">if&lt;/span> (batch_idx &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#f92672">%&lt;/span> accumulation_steps &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
optimizer&lt;span style="color:#f92672">.&lt;/span>step()
optimizer&lt;span style="color:#f92672">.&lt;/span>zero_grad()
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Only after processing accumulation_steps batches do we update the weights and zero the gradients.&lt;/p>
&lt;h2 id="benefits-of-gradient-accumulation">Benefits of Gradient Accumulation&lt;/h2>
&lt;h3 id="1-train-with-virtually-larger-batch-sizes">1. Train with &amp;ldquo;Virtually&amp;rdquo; Larger Batch Sizes&lt;/h3>
&lt;p>With an accumulation_steps of 3 and a batch_size of 40 (as in our example), you&amp;rsquo;re effectively training with a batch size of 120, but with the memory footprint of just 40 examples at once.&lt;/p>
&lt;h3 id="2-improved-training-stability">2. Improved Training Stability&lt;/h3>
&lt;p>Larger effective batch sizes often lead to more stable gradients and smoother loss curves, especially for complex models.&lt;/p>
&lt;h3 id="3-better-hardware-utilization">3. Better Hardware Utilization&lt;/h3>
&lt;p>This technique allows you to fully utilize limited GPU resources while still benefiting from large-batch training dynamics.&lt;/p>
&lt;h2 id="practical-considerations">Practical Considerations&lt;/h2>
&lt;p>When implementing gradient accumulation, keep these points in mind:&lt;/p>
&lt;ul>
&lt;li>Batch Normalization: If your model uses batch normalization layers, be aware that statistics are calculated per mini-batch, not across the accumulated batches. For some applications, this might affect performance.&lt;/li>
&lt;li>Learning Rate Scaling: With larger effective batch sizes, you might need to adjust your learning rate. A common heuristic is to scale the learning rate linearly with the effective batch size.&lt;/li>
&lt;li>Mixed Precision Training: Gradient accumulation works well with mixed precision training, giving you even more memory efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Gradient accumulation is one of those techniques that should be in every deep learning practitioner&amp;rsquo;s toolkit. It&amp;rsquo;s easy to implement, has almost no downside, and can dramatically improve your ability to train large models on limited hardware.&lt;/p>
&lt;p>Give the provided code example a try in your next PyTorch project - you might be surprised at how much it improves your training process!&lt;/p>
&lt;p>Happy training! 🚀&lt;/p>
&lt;h2 id="further-resources">Further Resources&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://pytorch.org/docs/stable/notes/amp_examples.html" target="_blank" rel="noopener">Deep Learning with Limited GPU Memory&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.deeplearningbook.org/contents/optimization.html" target="_blank" rel="noopener">Optimization for Deep Learning&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>