How do LSTM gates solve the vanishing gradient problem?
An LSTM maintains a cell state that flows through time via additive updates controlled by learned gates, giving gradients a near-linear path across many steps. The forget, input, and output gates let the network selectively retain, write, and expose information rather than crushing every signal through a squashing non-linearity at every step.
How to think about it
An LSTM cell carries two vectors between steps: the cell state C_t and the hidden state h_t. Three sigmoid gates control information flow:
| Gate | Formula | Role |
|---|---|---|
| Forget | f_t = σ(W_f [h_{t-1}, x_t] + b_f) | What fraction of C_{t-1} to erase |
| Input | i_t = σ(W_i [h_{t-1}, x_t] + b_i) | What new content to write |
| Output | o_t = σ(W_o [h_{t-1}, x_t] + b_o) | What part of the cell to expose |
The cell state update is additive:
C_t = f_t ⊙ C_{t-1} + i_t ⊙ tanh(W_c [h_{t-1}, x_t] + b_c)
h_t = o_t ⊙ tanh(C_t)
The ⊙ denotes element-wise multiplication. Because C_t is formed by adding to C_{t-1} (gated, but additive), gradients can flow back through the cell state without passing through a saturating non-linearity at every step — the highway for gradients is much smoother than in a vanilla RNN.
GRU simplifies this to two gates (reset and update), merges cell and hidden state, and achieves comparable performance on many tasks with fewer parameters:
h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ tanh(W [r_t ⊙ h_{t-1}, x_t])