The Vanishing Gradient Problem and How ReLU Fixed It

The vanishing gradient problem occurs when gradients shrink exponentially as they propagate backward through deep networks, leaving early layers nearly unchanged during training. Sigmoid activations caused this because their derivatives are always less than 0.25. ReLU fixed it by having a derivative of exactly 1 for all positive inputs, letting gradients flow backward through deep networks without shrinking.

Pithy Cyborg | AI FAQs – The Details

Question: What is the vanishing gradient problem, and how did ReLU solve what sigmoid activation couldn’t?

Asked by: Claude Sonnet 4.6

Answered by: Mike D (MrComputerScience)

From Pithy Cyborg | AI News Made Simple

And Pithy Security | Cybersecurity News

Why Sigmoid Activations Strangle Gradient Flow in Deep Networks

Backpropagation trains a neural network by computing the gradient of the loss function with respect to every weight, then nudging each weight in the direction that reduces loss. To reach the early layers of a deep network, gradients must be multiplied together repeatedly as they pass backward through each layer. This repeated multiplication is where sigmoid creates a fatal problem.

The sigmoid function squashes any input into the range (0, 1). Its derivative, which is what backpropagation actually uses, has a maximum value of 0.25 and approaches zero at both extremes. Every time a gradient passes backward through a sigmoid layer, it is multiplied by a number no larger than 0.25. Pass through 10 layers and the gradient has been multiplied by at most 0.25^10, which is roughly 0.000001. Pass through 20 layers and it effectively reaches zero.

Layers near the input receive gradients so small they might as well be zero. Their weights barely update during training. The network learns in its final layers and stays essentially random in its early layers, no matter how long you train. This was the central reason deep networks were considered intractable through the 1990s and early 2000s. Researchers could not train networks with more than a handful of layers reliably, which severely limited what neural networks could represent.

How ReLU’s Simple Design Keeps Gradients Alive

ReLU (Rectified Linear Unit) is defined as f(x) = max(0, x). For any positive input, the output equals the input. For any negative input, the output is zero. That is the entire function.

Its derivative is equally simple: 1 for all positive inputs, 0 for all negative inputs. When a gradient passes backward through a ReLU neuron that was active during the forward pass, it is multiplied by exactly 1. It passes through unchanged. Stack 100 ReLU layers and a gradient that enters the backward pass at layer 100 arrives at layer 1 with the same magnitude it started with, assuming all neurons on its path were active.

This is why the deep learning revolution happened when it did. The combination of ReLU activations, better weight initialization (Xavier and He initialization), and eventually batch normalization made training networks with dozens or hundreds of layers feasible for the first time. AlexNet’s 2012 ImageNet victory used ReLU explicitly because the authors found sigmoid and tanh too slow to train at that depth.

Why AI hallucinations persist is related to the same training dynamics: the layers of a deep network that learn factual associations are precisely the early layers that vanishing gradients starve of learning signal, which is part of why the factual grounding in large language models is unevenly distributed and hard to correct.

Where ReLU Falls Short and What Replaced It

ReLU introduced its own problem: dying ReLU. A neuron whose input is always negative outputs zero on every forward pass and receives a gradient of zero on every backward pass. Its weights never update. It is permanently dead. In large networks, a significant fraction of neurons can die early in training, particularly with high learning rates or poor initialization, reducing the network’s effective capacity.

Leaky ReLU addressed this by using a small slope (typically 0.01) for negative inputs instead of zero, keeping a trickle of gradient flowing. Parametric ReLU (PReLU) makes that slope a learned parameter. ELU (Exponential Linear Unit) uses an exponential curve for negative inputs to produce negative mean activations that can accelerate convergence.

In 2026, the most widely used activation in transformer models is GELU (Gaussian Error Linear Unit), which approximates a smooth probabilistic version of ReLU. GELU is used in GPT-2, BERT, and virtually every large language model because it empirically outperforms ReLU on language tasks, likely because its smooth non-linearity interacts better with attention mechanisms than ReLU’s hard threshold.

SwiGLU, a variant used in LLaMA, Mistral, and most modern open-weight models, combines a gating mechanism with a smooth activation to achieve better parameter efficiency than GELU alone. The evolution from sigmoid to ReLU to GELU to SwiGLU is a direct line: each generation solved a training dynamics problem that the previous generation introduced.

What This Means For You

Default to GELU or SwiGLU for transformer architectures. ReLU is still appropriate for CNNs and simpler feedforward networks but underperforms on attention-based models.
Use He initialization with ReLU layers. Xavier initialization was designed for sigmoid and tanh; He initialization accounts for ReLU’s zero-output half and produces better-behaved gradients at the start of training.
Monitor dead neuron rates during training. If more than 10 to 20% of your ReLU neurons are consistently outputting zero, lower your learning rate or switch to Leaky ReLU before continuing.
Add batch normalization or layer normalization to deep networks regardless of activation function. Normalization addresses gradient scale problems that activation function choice alone cannot fully prevent.
Understand that activation function choice interacts with architecture. The optimal activation for a CNN is not the optimal activation for a transformer, and benchmarking on your specific task is more reliable than following generic recommendations.

Pithy Cyborg | AI News Made Simple

Subscribe (Free): https://pithycyborg.substack.com/subscribe

Read archives (Free): https://pithycyborg.substack.com/archive

Pithy Security | Cybersecurity News

Subscribe (Free): https://pithysecurity.substack.com/subscribe

Read archives (Free): https://pithysecurity.substack.com/archive