Recurrent neural networks are neural architectures designed for data with an intrinsic ordering, where the current prediction should depend not only on the current input but also on what has been seen before. This makes them a natural fit for sequence problems such as language modeling, part-of-speech tagging, speech recognition, and time-series forecasting.
Unlike a standard feedforward network, an RNN reuses the same parameters at each step in the sequence. The network therefore maintains a hidden state that acts as a compressed summary of the past.
RNNs
An RNN processes a sequence one element at a time. At step , it combines the current input with the previous hidden state to produce a new hidden state.
Definition (Recurrent Neural Network Layer)
A simple recurrent neural network layer is defined by
and, if an output is required at each step,
Here is the input at time , is the hidden state, is the output, and is typically a tanh or ReLU activation. The parameters are shared across all time steps.
The parameters have the following roles:
maps the current input into the hidden state.
maps the previous hidden state into the next hidden state.
is the hidden-state bias.
maps the hidden state to the output.
is the output bias.
Because the same parameters are reused at every step, the model can process sequences of varying length without increasing the number of learned weights. Conceptually, the network can be unrolled through time into a deep computation graph with one copy of the recurrent update per time step.
Training Through Time
Training is typically performed by applying the backpropagation algorithm to the unrolled network. This is known as backpropagation through time or BPTT.
If the loss depends on all outputs, for example
then the gradients at early time steps depend on repeated multiplication by the recurrent Jacobian. This is the source of the main optimization difficulties.
From the computational graph point of view, the recurrent weight matrix influences the loss through every hidden state that depends on it. Hence the total gradient is a sum over time:
Since each hidden state also affects all later hidden states, the term itself accumulates contributions from future time steps. Writing this recursively,
Equivalently, if we expand the contribution from all future losses, then
where the product of Jacobians is understood to be the identity when . This makes the chain-rule structure explicit: loss at a later time flows backward through the output node, then through the chain of hidden-to-hidden transitions from time back to time , and finally to the recurrent weights used at time .
In computation-graph terms, is the gradient entering the output node at time , and moves that gradient from the output node to the hidden state node at the same time step. The product then carries the gradient backward along the recurrent edges of the unrolled graph until it reaches hidden state , and tells us how the recurrent weight used at time changes that hidden state.
The outer sum over collects all losses downstream of the use of at time , while the outer sum over adds together the contribution from every time step where the shared recurrent weight appears in the unrolled network.
Caution
Simple RNNs have several important limitations.
Vanishing gradients: when the recurrent dynamics repeatedly contract, information from earlier time steps has very little influence on the loss, so long-range dependencies are not learned well.
Exploding gradients: when the recurrent dynamics repeatedly expand, gradients can become numerically unstable. Gradient clipping is a common practical fix.
Short effective memory: even though the hidden state is in principle a summary of the whole past, in practice a simple RNN usually only retains useful information over relatively short ranges.
Sequential computation: hidden state depends on , so the time steps cannot be fully parallelized during training or inference.
State bottleneck: the entire past must be compressed into a fixed-size hidden vector, which can be restrictive for complex sequence tasks.
Common Uses
RNNs are most natural when inputs arrive in order and earlier observations should influence later predictions. Typical settings include:
sequence labeling, where an output is produced for each input element,
sequence classification, where the final hidden state is used to classify the whole sequence,
autoregressive generation, where the model predicts the next token or value from the past.
Long Short Term Memory (LSTM)
Long short-term memory networks modify the recurrent update so that the model has an explicit memory cell and learned gates that control reading, writing, and forgetting. The main goal is to make long-range dependencies easier to learn by preserving a path through time along which information and gradients can flow more stably.
Definition (LSTM Layer)
An LSTM maintains a hidden state and a cell state . At each time step,
Here is the sigmoid function and denotes elementwise multiplication.
Each gate has a distinct role:
Forget gate decides how much of the previous cell state to retain.
Input gate decides how much new information to write.
Candidate state proposes the new content to store.
Output gate decides how much of the cell state to expose as the hidden state.
If the input dimension is and the hidden dimension is , then each input-to-gate matrix has size , each hidden-to-gate matrix has size , and each bias has size . Since there are four separate gate computations, an LSTM has substantially more parameters than a simple RNN with the same hidden size.
Why LSTMs Help
The key change is the additive update of the cell state,
This gives a more direct path through time than the repeated nonlinear transformation in a simple RNN. When the forget gate is near one and the input gate is near zero, the cell can preserve information for many steps with relatively little distortion. This makes long-term dependencies easier to represent and optimize.
Caution
LSTMs improve on simple RNNs, but they do not remove every difficulty.
Higher computational cost: each step computes four gate-like transformations rather than one recurrent update.
More parameters: this increases memory use and the risk of overfitting when data are limited.
Sequential processing remains: the recurrence still prevents full parallelization across time.
Long-context limits still exist: LSTMs are better at retaining information, but very long sequences can still be difficult.
Training can still be delicate: careful initialization, optimization, and sometimes gradient clipping are still important.
Sequence Modeling Patterns
The same recurrent building block can be arranged in several common ways depending on the task.
Encoder Models
An encoder reads an input sequence and compresses it into a hidden representation, often the final hidden state or some function of all hidden states. These models are useful when the goal is to produce one output for the whole sequence.
Typical examples include:
sentiment classification from a sentence,
intent classification from an utterance,
forecasting from a fixed observation window.
Decoder Models
A decoder generates an output sequence one step at a time. At each step it conditions on its current recurrent state and, in autoregressive settings, on the previously generated output. These models are suitable for tasks where the output itself is sequential.
Typical examples include:
language modeling, where the next word is predicted from previous words,
sequence generation, where outputs are emitted until an end token is produced.
Encoder-Decoder Models
An encoder-decoder architecture first processes the input sequence with an encoder and then uses a decoder to produce a possibly different output sequence.
This is useful when:
the input and output lengths may differ,
the output should depend on the whole input sequence,
the task is naturally sequence-to-sequence.
Canonical applications include machine translation, summarization, and speech recognition. In the most basic form, the encoder’s final hidden state initializes the decoder. This creates a bottleneck because the entire input sequence must be summarized in a fixed-size vector, which becomes problematic for long or information-rich sequences.
Many-to-One, One-to-Many, and Many-to-Many
Another common way to categorize recurrent models is by the input/output layout:
Many-to-one: a full sequence is mapped to a single output, as in sequence classification.
One-to-many: a single latent representation or conditioning input produces a sequence, as in sequence generation.
Many-to-many: a sequence is mapped to a sequence, either with aligned outputs at each step or with separate encoder and decoder stages.
Sources
Elman, J. L. (1990). Finding structure in time.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks.