Education

Stochastic Gradient Descent Variance: Impact of Batch Size on the Stochastic Approximation of the Objective Function Gradient

March 12, 2026

Stochastic Gradient Descent (SGD) is one of the most widely used optimisation methods for training machine learning models, especially neural networks. Its popularity comes from a practical trade-off: instead of computing the exact gradient of the loss function over the entire dataset (which can be slow and memory-heavy), SGD estimates the gradient using a subset of data. This estimate is faster, but it introduces randomness—often described as gradient noise or variance.

Batch size sits at the centre of this trade-off. Small batches make training noisy but often faster per step, while large batches make the gradient estimate smoother but more expensive. Understanding how batch size impacts variance helps you pick training settings that converge reliably and generalise well. This topic also appears frequently in applied learning paths, including a data scientist course in Delhi, because it connects theory, compute constraints, and real-world model behaviour.

Table of Contents

What “SGD Variance” Actually Means

Consider a dataset of NNN examples and a loss function L(θ)L(theta)L(θ) defined as the average loss over all examples. The true gradient is the gradient of this full objective, usually called the population or full-batch gradient.

SGD uses a mini-batch of size BBB to estimate that gradient. The mini-batch gradient is an unbiased estimate under common sampling assumptions, meaning its expected value equals the true gradient. However, the estimate fluctuates from one batch to another. This fluctuation is the “variance” of SGD:

High variance: the estimated gradient direction changes sharply across steps.
Low variance: the estimated gradient is closer to the true gradient and changes more smoothly.

In practical terms, variance influences how stable training feels: loss curves, accuracy curves, and even whether training diverges for a given learning rate.

How Batch Size Changes the Gradient Approximation

Batch size controls how much averaging happens in each gradient estimate. With a larger batch, more samples contribute to the gradient, so random errors tend to cancel out. A useful rule of thumb is:

Variance decreases roughly proportional to 1/B1/B1/B when samples are independent and identically distributed (a simplified but helpful assumption).
Standard deviation decreases roughly proportional to 1/B1/sqrt{B}1/B.

What does that mean operationally?

Small batches (e.g., 8–64)

Higher gradient noise: updates “bounce” around the optimal direction.
Faster iteration speed: each step is cheap, so you can take many steps quickly.
Often better generalisation: the noise can act like a regulariser, helping models avoid sharp minima.

Large batches (e.g., 512–8192+)

Lower gradient noise: updates align more closely with the true gradient.
Higher compute and memory cost per step.
May require learning-rate tuning: overly aggressive learning rates can still destabilise training, and overly conservative ones can slow it down.

This is why many practitioners do not chase the largest possible batch size even if hardware allows it.

Convergence Speed vs Generalisation: The Practical Trade-off

Batch size does not just affect variance; it changes the training dynamics. Two common outcomes are worth separating:

1) Optimisation behaviour (reaching a low training loss)

Lower variance from larger batches can produce smoother, more predictable descent. For convex problems, this can be attractive. For deep learning, the landscape is non-convex, and “smooth descent” is not always the shortest path to a good solution.

Small-batch noise can help the optimiser escape plateaus or saddle points. It can also keep training moving when gradients become tiny.

2) Generalisation behaviour (performing well on unseen data)

Noise from smaller batches can prevent the optimiser from settling into solutions that fit training data too tightly. Empirically, many deep learning setups show that overly large batches can converge to solutions that generalise slightly worse unless additional regularisation or schedule tuning is applied.

These considerations are commonly explored in hands-on training labs in a data scientist course in Delhi, where learners compare loss curves, validation metrics, and training time across different batch sizes.

Choosing a Batch Size in Real Projects

There is no universal “best” batch size, but there are consistent decision patterns:

Start with a stable baseline
For many problems, batch sizes like 32, 64, or 128 are practical starting points because they balance stability and compute.
Tune learning rate along with batch size
Batch size and learning rate are linked. If you increase batch size, you often need to adjust the learning rate and sometimes the warm-up schedule. Without tuning, large batches can underperform simply due to mismatched optimisation settings.
Watch gradient stability signals
- If loss is wildly unstable or diverges, your effective noise may be too high for the learning rate.
- If progress is very slow and gradients look “too smooth,” you may be using a batch size that is too large for the learning schedule or model.
Respect hardware and throughput
If a batch size does not fit GPU memory, forcing it can trigger expensive workarounds. Techniques like gradient accumulation can simulate larger batches while keeping memory usage manageable.

Conclusion

SGD variance is a direct consequence of estimating the true objective gradient from only a subset of data. Batch size controls the quality of this stochastic approximation: larger batches reduce variance and stabilise updates, while smaller batches increase noise and can sometimes improve exploration and generalisation. The right choice depends on your model, dataset, compute budget, and tolerance for tuning.

A practical approach is to begin with a moderate batch size, tune the learning rate and schedule, and validate performance using both training and validation metrics. With this mindset, batch size stops being a guess and becomes a deliberate lever—one that is highly relevant in applied workflows and frequently discussed in a data scientist course in Delhi.