Deep Learning Medium Asked at GoogleAsked at MetaAsked at Amazon

How do LSTM gates solve the vanishing gradient problem?

For ML Engineer AI / LLM Engineer Data Scientist

The short answer

An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.

How to think about it

An LSTM cell carries two vectors between steps: the cell state C_t and the hidden state h_t. Three sigmoid gates control information flow:

Gate	Formula	Role
Forget	`f_t = σ(W_f [h_{t-1}, x_t] + b_f)`	What fraction of `C_{t-1}` to erase
Input	`i_t = σ(W_i [h_{t-1}, x_t] + b_i)`	What new content to write
Output	`o_t = σ(W_o [h_{t-1}, x_t] + b_o)`	What part of the cell to expose

The cell state update is additive:

C_t = f_t ⊙ C_{t-1} + i_t ⊙ tanh(W_c [h_{t-1}, x_t] + b_c)

h_t = o_t ⊙ tanh(C_t)

The ⊙ denotes element-wise multiplication. Because C_t is formed by adding to C_{t-1} (gated, but additive), gradients can flow back through the cell state without passing through a saturating non-linearity at every step — the highway for gradients is much smoother than in a vanilla RNN.

GRU simplifies this to two gates (reset and update), merges cell and hidden state, and achieves comparable performance on many tasks with fewer parameters:

h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ tanh(W [r_t ⊙ h_{t-1}, x_t])

Learn it properly Self-attention

How do LSTM gates solve the vanishing gradient problem?

Keep practising

Explore further