top of page

ML Notes from MIT Lectures (Intro - RNN)

Introduction to Deep Learning:

  • The inputs are known as a perceptron

  • The inputs and weights are multiplied and added by w0 which is 1 (bias)

  • The non-linearity activation function is to account for non-linear data points. These non-linear activation functions classify two binary groups or categories.

Classification & Regression Models:

  • Classification is a specific task within supervised learning, where the goal is to assign data points to predefined classes or categories.

  • Regression is another task within supervised learning, focusing on predicting continuous numerical values. Learning rates are a tool used for optimization to make sure the training data is not being overfitted.

  • Supervised learning encompasses both classification and regression, as they involve learning from labeled data.

3 steps to get out into deep learning:

  • Compute multiplication with corresponding weights & inputs

  • Add them all together (Summation)

  • Compute a non-linearity activation function

  • A dot product is the summation of all inputs & weights

The function for getting the output is:

$$ y=g(w_0+X^TW) $$


Perceptron often refers to a single layer of neurons in the context of a simple neural network, whereas neuron is a more general term used in the context of deep learning, which encompasses multi-layered networks such as feedforward neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). The linear transformation involves multiplying the inputs by their corresponding weights, summing them up, and adding a bias term. The activation function determines the "firing" or activation level of the neuron based on the computed output.

The neural network must be trained so it doesn’t repeat its mistakes, or more professionally its losses. To tell the neural network how wrong it is and how right or wrong it is, a loss function must be created. This is done by the prediction and true prediction and the distance between which is the loss.

  • The loss function is known as an objective function, cost function, and empirical risk

The softmax entropy loss penalizes the model when it assigns a low probability to the true class or a high probability to incorrect classes. The goal is to minimize this loss function during the training process, which leads to better alignment between the predicted probabilities and the true class labels. Optimization algorithms, such as stochastic gradient descent (SGD) or its variants, are typically used to adjust the model's parameters in the direction that minimizes the softmax entropy loss.

The minimization of predicted outputs and exact outputs is known as loss-entropy functions or empirical-loss functions. The algorithms in deep learning models are the manipulation of the weights. The one described in the illustrations above is the forward propagation of weights.

$J(W)$ is the loss function averaged through the whole dataset. $W = (W^0, W^1,...)$ is the list of all the weights in our neural network. The gradient function tells us how far a point is from the highest point. Since we’re trying to go to the lowest place possible we must go to the lowest point in the landscape. Gradients: Serve as a guide: The gradient serves as a guide for updating the weights and biases of the neural network. By calculating the gradient, the network understands how the loss function changes as the weights are adjusted. The gradient points in the direction of the steepest descent, indicating how the weights should be updated to minimize the loss. Minimizing the loss: The ultimate goal of training a neural network is to minimize the loss function. The loss function quantifies the discrepancy between the network's predictions and the true labels. By adjusting the weights based on the gradient information, the network can iteratively update its parameters in a way that reduces the loss and improves its predictive capabilities Learning from data: The relationship between the loss and the weights is essential for the network to learn from data. By updating the weights based on the gradient, the network can gradually adapt its parameters to capture patterns and relationships in the training data. The gradient allows the network to adjust the weights in a way that improves its ability to generalize and make accurate predictions on unseen examples. Optimization algorithms: Optimization algorithms, such as stochastic gradient descent (SGD) and its variants, leverage the gradient to iteratively update the weights. These algorithms utilize gradient information to determine the step size and direction for weight updates. By following the gradients, the network can navigate the loss landscape and converge to a set of weights that minimize the loss function.

This process is then repeated over and over again to choose a point, find its gradient, and go toward the opposite direction until the point of convergence is reached (at the minimum where all the inputs are going in one direction.) The gradients indicate the direction and magnitude of parameter adjustments needed to minimize the loss function on the training data. By iteratively updating the parameters in the opposite direction of the gradients, the network learns to improve its predictions on the training data. The gradient of the empirical loss function concerning the weights quantifies how the loss changes as each weight is adjusted. By following the gradients, the network can iteratively update its weights in a direction that minimizes the loss and improves its performance on the training data.

This whole process is called forward propagation.

  1. Are the weights (points in the landscape)

  2. Is done repeatedly until convergence (the lowest point)

  3. The gradient is then computed which is the partial derivative of the empirical loss respective to the weights.

  4. The weights are then optimized using machine learning algorithms to go toward the opposite direction of the gradient and are updated. Then, steps 1 - 4 are repeated until the minimum point of weights is found.

  5. Return weights —> parameters for the weights.

The purple expression is the $J(W)$ concerning the output. The orange expression is the output concerning $z_1$. The blue equation is $z_1$ concerning $w_1$. This process is known as backpropagation because it works backward from the output to the input. Backpropagation is a key algorithm used in training neural networks, and it enables the adjustment of weights and gradients between the input layer and the neurons in hidden layers. Backpropagation is used to compute the gradients of the loss function concerning the weights of the neural network, allowing the optimization of the network's performance through gradient descent. This answers, “How much does a small weight change affect the loss function, if it increases or decreases, and what can we do to decrease the loss function?” A learning rate is how far we have to go in the direction of the gradient in every iteration of backpropagation. The optimizer is the type of algorithm for the learning rate.

Once the gradients are calculated during the backward pass, they are used to update the weights during the optimization step, such as gradient descent. The weights are adjusted in the opposite direction of the gradients to minimize the loss and improve the network's performance. Backpropagation updates the weights used in the backward pass and iteratively updates it by computing the gradients of empirical loss concerning each weight by backward pass. Once the gradients are calculated during the backward pass, they are used to update the weights during the optimization step, such as gradient descent. The weights are adjusted in the opposite direction of the gradients to minimize the loss and improve the network's performance. The weights cannot be changed during the forward pass because it would disrupt the flow of information and lead to inconsistent predictions. The forward pass is solely responsible for computing the output based on the current weights, while the backward pass is used to compute gradients and update the weights to improve the network's performance. The forward pass contains the updated weights created by the backward pass to decrease the empirical loss which happens iteratively until the empirical loss is extremely low.

Small Learning rate converges slowly and gets stuck in false local minima. Large learning rates overshoot, become unstable and diverge.

The illustration on the left is the learning rate algorithms, their corresponding TensorFlow libraries, and references. The illustration on the right is used to represent the optimization of the empirical loss.

Due to the highly intensive computation necessary for finding the gradient for the dataset, a certain batch of datasets is taken so it’s faster to compute and also has a more accurate estimate of the true gradient! Ideally, the patterns from training data should be represented and generalized in testing data i.e. the output of the neural network should be generalized to the *actual output of the testing data. Regularization is a technique used to be introduced to the testing pipeline to discourage overfitting. Overfitting is essentially the overexpression of the training data than the testing data. One of these methods is Regularization I: Dropout, which is used to train the model to become more adaptable to the neural network nodes by dropping a certain percentage of activation in the layers. The speed is also increased because the amount of weights being computed is decreased.

The second method is Regularization II: Early Stopping, which stops the training before there is a chance to overfit. A certain portion of training data can be not used to train the model and that is used as a parameter or synthetic data set for Early Stopping regularization techniques to stop the model from overfitting.

RNNs, Transformers, and Attention:

Sequential Data Modeling:

This includes more than one input or output…

  • 1st Figure: Many Inputs, One Output

  • 2nd Figure: One Input, Many Outputs

  • 3rd Figure: Many Inputs, Many Outputs

This illustration above is an example of sequential data modeling because it utilizes previous data from the sequence in time to generate an output. An example of the first diagram is sentiment classification, the second example is text generation and image captioning, and the third example is translation & forecasting music generation.

The following illustration considers each input at a single time step and the objective is to utilize the neural network to generate a single output corresponding to the input. All these models are copies of each other with different inputs and at different time steps. $y_t=f(x_t)$ refers to the output where the $x_t$ is the input with the time step and is transformed through the non-linear activation function to get the output. The problem with this model is that certain inputs & outputs could be related to different replicas - each corresponding input, neuron, and output.

To solve this problem a recurrence relation is created and denoted as variable $h_0, h_1,...$ to carry over the information from the previous time step to the next time step. The output ergo, is based on the function of the input data and the recurrence relation - which carries over the memory & information from the previous time step. This recurrence relation is fed back into the neuron and iteratively updated over time creating recurrent neural networks (RNNs). The state of the RNNs has the recurrence relation that is updated at each step as a sequence is processed.

RNNs can be mathematically described by the equation: $h_t=f_w(x_t, h_t-_1)$. Therefore, the cell state is dependent upon the function with weights multiplied by the input and the old state. Note: the same function and set of parameters (weights) are used at every time step.

In the illustration above: the RNN is defined as my_rnn, the hidden states - the recurrent relation. The objective is to predict the next word in the sentence and each word is sent to the RNN along with the hidden sentence. This function creates a prediction for the output and a hidden state to be used for the next prediction.

The weight matrix is multiplied by the previous hidden state and another weight and multiplied by the input. This together is applied by the non-linearity function which produces the output vector. The output at a given time step can be modified by modifying the hidden state. There are weights in between the input to the hidden state, the hidden state to other time steps, and the weights of the hidden state to the output. The same weights matrices are being re-used at every time step. A prediction at a particular time step will amount to an empirical loss derived from the gradient descent per every time step. To find the total empirical loss of the RNN, the losses of each time step are computed and summed together.

The initialization of the weight matrices is set to every weight in the layer and then the hidden states to 0. The hidden state is then updated and the output is computed. Then, the current output and hidden state are returned. Tensorflow abstracts the RNN operation as tf.keras.layers.SimpleRNN(rnn_units):

Sequence Modeling: Design Criteria:

  1. Handle Variable-length sequences, the RNN must be able to handle the varying lengths of variable lengths.

  2. Track long-term dependencies, the RNN must be able to track long-term patterns or dependencies with varying time steps.

  3. Maintain information about the order, the information of the outputs is based on prior inputs, and the specific order of the observations is maintained.

  4. Share parameters across the sequence, the weight matrices of the RNN are always constant among every time step in the RNN.

Design Criteria:

  1. Handle Variable-Length Sequences

The given sentence and words must be interpreted into numerical inputs to be fed into the RNN. Embedding: transforms indexes into an input vector and consists of two methods: one-hot embedding & learned embedding. The process for encoding language for a neural network: 1. vocabulary - a raw form of words 2. word to index 3. Index to fixed-sized vector.

Embedding Process:

  1. Vocabulary - Preprocessed, Raw form of Strings

  2. Indexing - Each vocab word is assigned an index for its order in a list (word —> 1, word 2 —> 2, etc), doesn’t need to make any sense & random

One-hot embedding method:

The indexes are then transformed into fixed-sized vectors by the one-hot embedding method. These indexes are formatted in a type of vector called: embedding vector which is a fixed-length encoding where the index of observation is given a value of 1 of the size of the vocabulary.

Learned embedding method:

A neural network is trained to learn an embedding that captures some inherent meaning or inherent semantics in the input data and maps the data closer together in the embedding.

  1. Handling the Long-term dependencies

  • The RNN should be able to track long-term dependencies and handle the differences in order and length.

  • The dependencies refer to information that could be in the first few words in a sequence but, isn’t relevant or doesn’t come up later on the sequence. For instance, the same words in a different order could have differences in meanings.

  1. Maintain information about the order

When operating under backpropagation under RNNs it’s vital to update the weights from the total loss from all the time steps, from $x_t$ to $x_o$ - which is the unit for the time steps. What this means is that the data, predictions, and resulting errors are backpropagated from the first input data which is why it’s given the term backpropagation through time (BPTT). Unfortunately, the RNN demands many repeated computations and multiplications of the weight matrices. One of the problems is called exploding gradients and the technique used to mitigate this problem is called gradient clipping to scale big gradients. On the other hand, the other problem is called vanishing gradients where the weight matrix goes extremely close to zero and the three solutions include how the activation function is defined, how the network architecture can be changed, and how weight initialization can be utilized.

To track long-term dependencies, one of three things can be utilized to effectively mitigate the vanishing gradient problem. One of them includes the changing of the activation function where certain shrinking gradients can be safeguarded against. With the ReLU function, the derivative function is one so when the gradients are approaching 0, the function is used to push the gradients (between $x>0$) back towards values closer to 1. The second technique includes weight initialization where the weights are initialized to the identity matrix ergo, the biases of the weights are restricted from shrinking to zero rapidly. The third technique includes the gated cells which use gates to selectively add or remove information within each recurrent unit with long short-term memory (LSTMs) - networks rely on a gated cell to track information throughout many time steps. The gated cells selectively control the flow of information through into the recurrent unit while maintaining information that’s relevant and removing the information that isn’t.

Long Short Term Memory (LSTMs) control information through 1.) forget 2. store 3.) update and 4.) output. Tensorflow code: tf.keras.layers.LSTM(num_units). LSTMs achieve long-term dependencies by maintaining a cell state which is independent of what’s directly outputted and is updated through removing info that’s irrelevant, storing information that is, updating the cell state, and then filtering into the output. The LSTM can be trained through BPTT and allows for uninterrupted gradient flow which eliminates the vanishing gradient problem. The uninterrupted gradient flow allows the network to propagate the gradients back in time without significant loss, enabling the effective training of deep LSTM architectures and the capture of long-term dependencies. This uninterrupted gradient flow refers to how the gated cells method alleviated the issue of the vanishing gradient problem, therefore, allowing for more gradient to be computed through the RNN. Main Concepts of LSTMs:

Limitation of RNNs:

  1. Encoding Bottleneck - where the sequential data fed into the RNNs time step by time step is properly maintained so it can be learned efficiently.

  2. Slow, no parallelization - the utilization of parallel computing techniques to accelerate the training or inference process of machine learning models.

  3. Not long memory - the capacity of the RNN and the LSTM isn’t large enough to compute large amounts of data.

Goal of Sequence Modeling (RNNs):

Given an input with time steps, a feature vector and is used to predict an output. RNNs use the notion of recurrence to maintain the order of information through time steps.

Desired Capabilities:

  1. Continuous Stream

  2. Parallelization

  3. Long-memory

Attention is the most powerful method for removing recurrence and creating the most efficient RNN possible. This method identifies the most important parts to attend to and extracts the features with high attention.

Example: Search

  • The query is the input and the keys are the outputs. The RNN then calculates how similar the key to the query is and selects the most relevant information. The item that’s most relevant to the key is called a value.

The goal of learning self-attention for recurrent neural networks (RNNs) is to identify and attend to the important features in input

  1. Encode position information - which captures the order of inputs in a sequence and is related to the embedding method

  2. Extract query, key, value for search - this uses another deep layer that transforms the positional embedding in a different way generating a query, another neural network layer generating a key, and another layer for the value. Contains three deep neural network layers. The purpose is to compute the pairwise similarity between each query and key. The queries and the keys can be computed in both vectors and matrices.

  1. Compute attention weighting - define the relationship between the sequential data and it’s components $softmax(Q*K^T/scaling)$. The softmax function compreses the data between 0 and 1.

  1. Extract features with high attention. The attention weighting matrix is multiplied by the values (the items that has a high similarity between the key and query) to get the output. This creates a dot product operation.

The goal of self-attention is to eliminate recurrence and attend to the most important features in the input data. First, the architecture of the input data is computed into positional encoding as time steps. Second, the neural networks is applied three-fold to transform the positional encoding into each the three matrices (key, query, and value) which computes the self-attention weight score. Third, the important features are self-attended through that is then extracted.

The architecture network defines a single self-attention head and could create a multiple network of heads creating a rich neural network. Each self-attention head specializes with it’s specific inputs.


  1. RNNs are well suited for sequence modeling tasks

  2. Model sequences via a recurrence relation

  3. Training RNNs with backpropagation through time

  4. Models for music generation, classification, machine translation, and more

  5. Self-attention to model sequences without recurrence


bottom of page