<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Deep Learning | Mahyar's world 🌏</title><link>https://mahyar-osanlouy.com/tag/deep-learning/</link><atom:link href="https://mahyar-osanlouy.com/tag/deep-learning/index.xml" rel="self" type="application/rss+xml"/><description>Deep Learning</description><generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Mon, 06 Jan 2025 00:00:00 +0000</lastBuildDate><image><url>https://mahyar-osanlouy.com/media/icon_hu35e4e9c9135f02752aab27d124db531b_75212_512x512_fill_lanczos_center_3.png</url><title>Deep Learning</title><link>https://mahyar-osanlouy.com/tag/deep-learning/</link></image><item><title>Supercharge Your PyTorch Training with Gradient Accumulation</title><link>https://mahyar-osanlouy.com/post/gradient-accumulation-pytorch/</link><pubDate>Mon, 06 Jan 2025 00:00:00 +0000</pubDate><guid>https://mahyar-osanlouy.com/post/gradient-accumulation-pytorch/</guid><description>&lt;h2 id="introduction">Introduction&lt;/h2>
&lt;p>When training large deep learning models, you often face a fundamental limitation: GPU memory. Larger batch sizes generally lead to more stable training and sometimes better convergence, but what if your GPU simply can&amp;rsquo;t handle the memory requirements of your ideal batch size?&lt;/p>
&lt;p>Enter gradient accumulation - a simple yet powerful technique that allows you to effectively increase your batch size without increasing memory usage. In this post, I&amp;rsquo;ll show you how to implement this technique in PyTorch and explain why it might be exactly what your training pipeline needs.&lt;/p>
&lt;h2 id="what-is-gradient-accumulation">What is Gradient Accumulation?&lt;/h2>
&lt;p>Gradient accumulation is a technique where you:&lt;/p>
&lt;ul>
&lt;li>Process smaller mini-batches sequentially&lt;/li>
&lt;li>Accumulate (add up) their gradients&lt;/li>
&lt;li>Update your model weights only after processing several mini-batches&lt;/li>
&lt;/ul>
&lt;p>This simulates training on a larger batch size without the memory requirements of loading that entire batch at once. It&amp;rsquo;s particularly useful when:&lt;/p>
&lt;ul>
&lt;li>You&amp;rsquo;re training very large models&lt;/li>
&lt;li>Working with limited GPU resources&lt;/li>
&lt;li>Need the stability of larger batch sizes&lt;/li>
&lt;/ul>
&lt;h2 id="implementing-gradient-accumulation-in-pytorch">Implementing Gradient Accumulation in PyTorch&lt;/h2>
&lt;p>The implementation is surprisingly straightforward. Here&amp;rsquo;s a complete working example:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#f92672">import&lt;/span> torch
&lt;span style="color:#f92672">import&lt;/span> torch.nn &lt;span style="color:#66d9ef">as&lt;/span> nn
&lt;span style="color:#f92672">import&lt;/span> torch.optim &lt;span style="color:#66d9ef">as&lt;/span> optim
&lt;span style="color:#f92672">from&lt;/span> torch.utils.data &lt;span style="color:#f92672">import&lt;/span> DataLoader, TensorDataset
&lt;span style="color:#75715e"># Create a simple dataset&lt;/span>
features, targets &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>randn(&lt;span style="color:#ae81ff">1200&lt;/span>, &lt;span style="color:#ae81ff">8&lt;/span>), torch&lt;span style="color:#f92672">.&lt;/span>randn(&lt;span style="color:#ae81ff">1200&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>)
dataset &lt;span style="color:#f92672">=&lt;/span> TensorDataset(features, targets)
data_loader &lt;span style="color:#f92672">=&lt;/span> DataLoader(dataset, batch_size&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">40&lt;/span>, shuffle&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)
&lt;span style="color:#75715e"># Define a basic neural network&lt;/span>
model &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>Sequential(
nn&lt;span style="color:#f92672">.&lt;/span>Linear(&lt;span style="color:#ae81ff">8&lt;/span>, &lt;span style="color:#ae81ff">16&lt;/span>),
nn&lt;span style="color:#f92672">.&lt;/span>ReLU(),
nn&lt;span style="color:#f92672">.&lt;/span>Linear(&lt;span style="color:#ae81ff">16&lt;/span>, &lt;span style="color:#ae81ff">1&lt;/span>)
)
loss_fn &lt;span style="color:#f92672">=&lt;/span> nn&lt;span style="color:#f92672">.&lt;/span>MSELoss()
optimizer &lt;span style="color:#f92672">=&lt;/span> optim&lt;span style="color:#f92672">.&lt;/span>SGD(model&lt;span style="color:#f92672">.&lt;/span>parameters(), lr&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">0.01&lt;/span>)
accumulation_steps &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span>
num_epochs &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;span style="color:#66d9ef">for&lt;/span> epoch &lt;span style="color:#f92672">in&lt;/span> range(num_epochs):
&lt;span style="color:#66d9ef">for&lt;/span> batch_idx, (inputs, labels) &lt;span style="color:#f92672">in&lt;/span> enumerate(data_loader):
outputs &lt;span style="color:#f92672">=&lt;/span> model(inputs)
loss &lt;span style="color:#f92672">=&lt;/span> loss_fn(outputs, labels) &lt;span style="color:#f92672">/&lt;/span> accumulation_steps
loss&lt;span style="color:#f92672">.&lt;/span>backward()
&lt;span style="color:#66d9ef">if&lt;/span> (batch_idx &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#f92672">%&lt;/span> accumulation_steps &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
optimizer&lt;span style="color:#f92672">.&lt;/span>step()
optimizer&lt;span style="color:#f92672">.&lt;/span>zero_grad()
print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Epoch &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>epoch &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>num_epochs&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">, Loss: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>loss&lt;span style="color:#f92672">.&lt;/span>item() &lt;span style="color:#f92672">*&lt;/span> accumulation_steps&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">.4f&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
print(&lt;span style="color:#e6db74">&amp;#34;Training finished&amp;#34;&lt;/span>)
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Let&amp;rsquo;s break down the key components:&lt;/p>
&lt;h3 id="1-set-your-accumulation-steps">1. Set your accumulation steps&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">accumulation_steps &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3&lt;/span>
&lt;/code>&lt;/pre>&lt;/div>&lt;p>This defines how many mini-batches to process before updating model weights.&lt;/p>
&lt;h3 id="2-adjust-your-loss-calculation">2. Adjust your loss calculation&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">loss &lt;span style="color:#f92672">=&lt;/span> loss_fn(outputs, labels) &lt;span style="color:#f92672">/&lt;/span> accumulation_steps
&lt;/code>&lt;/pre>&lt;/div>&lt;p>We divide the loss by the number of accumulation steps to ensure the gradients are properly scaled.&lt;/p>
&lt;h3 id="3-accumulate-gradients-but-delay-the-optimizer-step">3. Accumulate gradients but delay the optimizer step&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">loss&lt;span style="color:#f92672">.&lt;/span>backward()
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Call backward() as usual, but don&amp;rsquo;t immediately call optimizer.step().&lt;/p>
&lt;h3 id="4-update-weights-after-accumulation">4. Update weights after accumulation&lt;/h3>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4">&lt;code class="language-python" data-lang="python">&lt;span style="color:#66d9ef">if&lt;/span> (batch_idx &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>) &lt;span style="color:#f92672">%&lt;/span> accumulation_steps &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
optimizer&lt;span style="color:#f92672">.&lt;/span>step()
optimizer&lt;span style="color:#f92672">.&lt;/span>zero_grad()
&lt;/code>&lt;/pre>&lt;/div>&lt;p>Only after processing accumulation_steps batches do we update the weights and zero the gradients.&lt;/p>
&lt;h2 id="benefits-of-gradient-accumulation">Benefits of Gradient Accumulation&lt;/h2>
&lt;h3 id="1-train-with-virtually-larger-batch-sizes">1. Train with &amp;ldquo;Virtually&amp;rdquo; Larger Batch Sizes&lt;/h3>
&lt;p>With an accumulation_steps of 3 and a batch_size of 40 (as in our example), you&amp;rsquo;re effectively training with a batch size of 120, but with the memory footprint of just 40 examples at once.&lt;/p>
&lt;h3 id="2-improved-training-stability">2. Improved Training Stability&lt;/h3>
&lt;p>Larger effective batch sizes often lead to more stable gradients and smoother loss curves, especially for complex models.&lt;/p>
&lt;h3 id="3-better-hardware-utilization">3. Better Hardware Utilization&lt;/h3>
&lt;p>This technique allows you to fully utilize limited GPU resources while still benefiting from large-batch training dynamics.&lt;/p>
&lt;h2 id="practical-considerations">Practical Considerations&lt;/h2>
&lt;p>When implementing gradient accumulation, keep these points in mind:&lt;/p>
&lt;ul>
&lt;li>Batch Normalization: If your model uses batch normalization layers, be aware that statistics are calculated per mini-batch, not across the accumulated batches. For some applications, this might affect performance.&lt;/li>
&lt;li>Learning Rate Scaling: With larger effective batch sizes, you might need to adjust your learning rate. A common heuristic is to scale the learning rate linearly with the effective batch size.&lt;/li>
&lt;li>Mixed Precision Training: Gradient accumulation works well with mixed precision training, giving you even more memory efficiency.&lt;/li>
&lt;/ul>
&lt;h2 id="conclusion">Conclusion&lt;/h2>
&lt;p>Gradient accumulation is one of those techniques that should be in every deep learning practitioner&amp;rsquo;s toolkit. It&amp;rsquo;s easy to implement, has almost no downside, and can dramatically improve your ability to train large models on limited hardware.&lt;/p>
&lt;p>Give the provided code example a try in your next PyTorch project - you might be surprised at how much it improves your training process!&lt;/p>
&lt;p>Happy training! 🚀&lt;/p>
&lt;h2 id="further-resources">Further Resources&lt;/h2>
&lt;ul>
&lt;li>&lt;a href="https://pytorch.org/docs/stable/notes/amp_examples.html" target="_blank" rel="noopener">Deep Learning with Limited GPU Memory&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.deeplearningbook.org/contents/optimization.html" target="_blank" rel="noopener">Optimization for Deep Learning&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Example Project</title><link>https://mahyar-osanlouy.com/project/example/</link><pubDate>Wed, 27 Apr 2016 00:00:00 +0000</pubDate><guid>https://mahyar-osanlouy.com/project/example/</guid><description>&lt;p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Duis posuere tellus ac convallis placerat. Proin tincidunt magna sed ex sollicitudin condimentum. Sed ac faucibus dolor, scelerisque sollicitudin nisi. Cras purus urna, suscipit quis sapien eu, pulvinar tempor diam. Quisque risus orci, mollis id ante sit amet, gravida egestas nisl. Sed ac tempus magna. Proin in dui enim. Donec condimentum, sem id dapibus fringilla, tellus enim condimentum arcu, nec volutpat est felis vel metus. Vestibulum sit amet erat at nulla eleifend gravida.&lt;/p>
&lt;p>Nullam vel molestie justo. Curabitur vitae efficitur leo. In hac habitasse platea dictumst. Sed pulvinar mauris dui, eget varius purus congue ac. Nulla euismod, lorem vel elementum dapibus, nunc justo porta mi, sed tempus est est vel tellus. Nam et enim eleifend, laoreet sem sit amet, elementum sem. Morbi ut leo congue, maximus velit ut, finibus arcu. In et libero cursus, rutrum risus non, molestie leo. Nullam congue quam et volutpat malesuada. Sed risus tortor, pulvinar et dictum nec, sodales non mi. Phasellus lacinia commodo laoreet. Nam mollis, erat in feugiat consectetur, purus eros egestas tellus, in auctor urna odio at nibh. Mauris imperdiet nisi ac magna convallis, at rhoncus ligula cursus.&lt;/p>
&lt;p>Cras aliquam rhoncus ipsum, in hendrerit nunc mattis vitae. Duis vitae efficitur metus, ac tempus leo. Cras nec fringilla lacus. Quisque sit amet risus at ipsum pharetra commodo. Sed aliquam mauris at consequat eleifend. Praesent porta, augue sed viverra bibendum, neque ante euismod ante, in vehicula justo lorem ac eros. Suspendisse augue libero, venenatis eget tincidunt ut, malesuada at lorem. Donec vitae bibendum arcu. Aenean maximus nulla non pretium iaculis. Quisque imperdiet, nulla in pulvinar aliquet, velit quam ultrices quam, sit amet fringilla leo sem vel nunc. Mauris in lacinia lacus.&lt;/p>
&lt;p>Suspendisse a tincidunt lacus. Curabitur at urna sagittis, dictum ante sit amet, euismod magna. Sed rutrum massa id tortor commodo, vitae elementum turpis tempus. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aenean purus turpis, venenatis a ullamcorper nec, tincidunt et massa. Integer posuere quam rutrum arcu vehicula imperdiet. Mauris ullamcorper quam vitae purus congue, quis euismod magna eleifend. Vestibulum semper vel augue eget tincidunt. Fusce eget justo sodales, dapibus odio eu, ultrices lorem. Duis condimentum lorem id eros commodo, in facilisis mauris scelerisque. Morbi sed auctor leo. Nullam volutpat a lacus quis pharetra. Nulla congue rutrum magna a ornare.&lt;/p>
&lt;p>Aliquam in turpis accumsan, malesuada nibh ut, hendrerit justo. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Quisque sed erat nec justo posuere suscipit. Donec ut efficitur arcu, in malesuada neque. Nunc dignissim nisl massa, id vulputate nunc pretium nec. Quisque eget urna in risus suscipit ultricies. Pellentesque odio odio, tincidunt in eleifend sed, posuere a diam. Nam gravida nisl convallis semper elementum. Morbi vitae felis faucibus, vulputate orci placerat, aliquet nisi. Aliquam erat volutpat. Maecenas sagittis pulvinar purus, sed porta quam laoreet at.&lt;/p></description></item></channel></rss>