datarekha
Deep Learning Medium Asked at GoogleAsked at MetaAsked at Amazon

How do LSTM gates solve the vanishing gradient problem?

The short answer

An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.

How to think about it

An LSTM cell carries two vectors between steps: the cell state C_t and the hidden state h_t. Three sigmoid gates control information flow:

GateFormulaRole
Forgetf_t = σ(W_f [h_{t-1}, x_t] + b_f)What fraction of C_{t-1} to erase
Inputi_t = σ(W_i [h_{t-1}, x_t] + b_i)What new content to write
Outputo_t = σ(W_o [h_{t-1}, x_t] + b_o)What part of the cell to expose

The cell state update is additive:

C_t = f_t ⊙ C_{t-1} + i_t ⊙ tanh(W_c [h_{t-1}, x_t] + b_c)

h_t = o_t ⊙ tanh(C_t)

The denotes element-wise multiplication. Because C_t is formed by adding to C_{t-1} (gated, but additive), gradients can flow back through the cell state without passing through a saturating non-linearity at every step — the highway for gradients is much smoother than in a vanilla RNN.

GRU simplifies this to two gates (reset and update), merges cell and hidden state, and achieves comparable performance on many tasks with fewer parameters:

h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ tanh(W [r_t ⊙ h_{t-1}, x_t])

Learn it properly Self-attention

Keep practising

All Deep Learning questions

Explore further

Skip to content