In this notebook, we will have a basic introduction to PyTorch
and work on a toy NLP task. Following resources have been used in preparation of this notebook:
Many thanks to Angelica Sun and John Hewitt for their feedback.
PyTorch is a machine learning framework that is used in both academia and industry for various applications. PyTorch started of as a more flexible alternative to TensorFlow, which is another popular machine learning framework. At the time of its release, PyTorch
appealed to the users due to its user friendly nature: as opposed to defining static graphs before performing an operation as in TensorFlow
, PyTorch
allowed users to define their operations as they go, which is also the approached integrated by TensorFlow
in its following releases. Although TensorFlow
is more widely preferred in the industry, PyTorch
is often times the preferred machine learning framework for researchers. If you would like to learn more about the differences between the two, you can check out this blog post.
Now that we have learned enough about the background of PyTorch
, let's start by importing it into our notebook. To install PyTorch
, you can follow the instructions here. Alternatively, you can open this notebook using Google Colab
, which already has PyTorch
installed in its base kernel. Once you are done with the installation process, run the following cell:
import torch
import torch.nn as nn
# Import pprint, module we use for making our print statements prettier
import pprint
pp = pprint.PrettyPrinter()
We are all set to start our tutorial. Let's dive in!
Tensors are the most basic building blocks in PyTorch
. Tensors are similar to matrices, but the have extra properties and they can represent higher dimensions. For example, an square image with 256 pixels in both sides can be represented by a 3x256x256
tensor, where the first 3 dimensions represent the color channels, red, green and blue.
There are several ways to instantiate tensors in PyTorch
, which we will go through next.
We can initalize a tensor from a Python
list, which could include sublists. The dimensions and the data types will be automatically inferred by PyTorch
when we use torch.tensor()
.
# Initialize a tensor from a Python List
data = [
[0, 1],
[2, 3],
[4, 5]
]
x_python = torch.tensor(data)
# Print the tensor
x_python
tensor([[0, 1], [2, 3], [4, 5]])
We can also call torch.tensor()
with the optional dtype
parameter, which will set the data type. Some useful datatypes to be familiar with are: torch.bool
, torch.float
, and torch.long
.
# We are using the dtype to create a tensor of particular type
x_float = torch.tensor(data, dtype=torch.float)
x_float
tensor([[0., 1.], [2., 3.], [4., 5.]])
# We are using the dtype to create a tensor of particular type
x_bool = torch.tensor(data, dtype=torch.bool)
x_bool
tensor([[False, True], [ True, True], [ True, True]])
We can also get the same tensor in our specified data type using methods such as float()
, long()
etc.
x_python.float()
tensor([[0., 1.], [2., 3.], [4., 5.]])
We can also use tensor.FloatTensor
, tensor.LongTensor
, tensor.Tensor
classes to instantiate a tensor of particular type. LongTensor
s are particularly important in NLP as many methods that deal with indices require the indices to be passed as a LongTensor
, which is a 64 bit integer.
# `torch.Tensor` defaults to float
# Same as torch.FloatTensor(data)
x = torch.Tensor(data)
x
tensor([[0., 1.], [2., 3.], [4., 5.]])
We can also initialize a tensor from a NumPy
array.
import numpy as np
# Initialize a tensor from a NumPy array
ndarray = np.array(data)
x_numpy = torch.from_numpy(ndarray)
# Print the tensor
x_numpy
tensor([[0, 1], [2, 3], [4, 5]])
We can also initialize a tensor from another tensor, using the following methods:
torch.ones_like(old_tensor)
: Initializes a tensor of 1s
.torch.zeros_like(old_tensor)
: Initializes a tensor of 0s
.torch.rand_like(old_tensor)
: Initializes a tensor where all the elements are sampled from a uniform distribution between 0
and 1
.torch.randn_like(old_tensor)
: Initializes a tensor where all the elements are sampled from a normal distribution.All of these methods preserve the tensor properties of the original tensor passed in, such as the shape
and device
, which we will cover in a bit.
# Initialize a base tensor
x = torch.tensor([[1., 2], [3, 4]])
x
tensor([[1., 2.], [3., 4.]])
# Initialize a tensor of 0s
x_zeros = torch.zeros_like(x)
x_zeros
tensor([[0., 0.], [0., 0.]])
# Initialize a tensor of 1s
x_ones = torch.ones_like(x)
x_ones
tensor([[1., 1.], [1., 1.]])
# Initialize a tensor where each element is sampled from a uniform distribution
# between 0 and 1
x_rand = torch.rand_like(x)
x_rand
tensor([[0.8979, 0.7173], [0.3067, 0.1246]])
# Initialize a tensor where each element is sampled from a normal distribution
x_randn = torch.randn_like(x)
x_randn
tensor([[-0.6749, -0.8590], [ 0.6666, 1.1185]])
We can also instantiate tensors by specifying their shapes (which we will cover in more detail in a bit). The methods we could use follow the ones in the previous section:
torch.zeros()
torch.ones()
torch.rand()
torch.randn()
# Initialize a 2x3x2 tensor of 0s
shape = (4, 2, 2)
x_zeros = torch.zeros(shape) # x_zeros = torch.zeros(4, 3, 2) is an alternative
x_zeros
tensor([[[0., 0.], [0., 0.]], [[0., 0.], [0., 0.]], [[0., 0.], [0., 0.]], [[0., 0.], [0., 0.]]])
torch.arange()
¶We can also create a tensor with torch.arange(end)
, which returns a 1-D
tensor with elements ranging from 0
to end-1
. We can use the optional start
and step
parameters to create tensors with different ranges.
# Create a tensor with values 0-9
x = torch.arange(10)
x
tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
Tensors have a few properties that are important for us to cover. These are namely shape
, and the device
properties.
The dtype
property lets us see the data type of a tensor.
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.ones(3, 2)
x.dtype
torch.float32
The shape
property tells us the shape of our tensor. This can help us identify how many dimensional our tensor is as well as how many elements exist in each dimension.
# Initialize a 3x2 tensor, with 3 rows and 2 columns
x = torch.Tensor([[1, 2], [3, 4], [5, 6]])
x
tensor([[1., 2.], [3., 4.], [5., 6.]])
# Print out its shape
# Same as x.size()
x.shape
torch.Size([3, 2])
# Print out the number of elements in a particular dimension
# 0th dimension corresponds to the rows
x.shape[0]
3
We can also get the size of a particular dimension with the size()
method.
# Get the size of the 0th dimension
x.size(0)
3
We can change the shape of a tensor with the view()
method.
# Example use of view()
# x_view shares the same memory as x, so changing one changes the other
x_view = x.view(2, 3)
x_view
tensor([[1., 2., 3.], [4., 5., 6.]])
# We can ask PyTorch to infer the size of a dimension with -1
x_view = x.view(3, -1)
x_view
tensor([[1., 2.], [3., 4.], [5., 6.]])
We can also use torch.reshape()
method for a similar purpose. There is a subtle difference between reshape()
and view()
: view()
requires the data to be stored contiguously in the memory. You can refer to this StackOverflow answer for more information. In simple terms, contiguous means that the way our data is laid out in the memory is the same as the way we would read elements from it. This happens because some methods, such as transpose()
and view()
, do not actually change how our data is stored in the memory. They just change the meta information about out tensor, so that when we use it we will see the elements in the order we expect.
reshape()
calls view()
internally if the data is stored contiguously, if not, it returns a copy. The difference here isn't too important for basic tensors, but if you perform operations that make the underlying storage of the data non-contiguous (such as taking a transpose), you will have issues using view()
. If you would like to match the way your tensor is stored in the memory to how it is used, you can use the contiguous()
method.
# Change the shape of x to be 3x2
# x_reshaped could be a reference to or copy of x
x_reshaped = torch.reshape(x, (2, 3))
x_reshaped
tensor([[1., 2., 3.], [4., 5., 6.]])
We can use torch.unsqueeze(x, dim)
function to add a dimension of size 1
to the provided dim
, where x
is the tensor. We can also use the corresponding use torch.squeeze(x)
, which removes the dimensions of size 1
.
# Initialize a 5x2 tensor, with 5 rows and 2 columns
x = torch.arange(10).reshape(5, 2)
x
tensor([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
# Add a new dimension of size 1 at the 1st dimension
x = x.unsqueeze(1)
x.shape
torch.Size([5, 1, 2])
# Squeeze the dimensions of x by getting rid of all the dimensions with 1 element
x = x.squeeze()
x.shape
torch.Size([5, 2])
If we want to get the total number of elements in a tensor, we can use the numel()
method.
x
tensor([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]])
# Get the number of elements in tensor.
x.numel()
10
Device property tells PyTorch
where to store our tensor. Where a tensor is stored determines which device, GPU
or CPU
, would be handling the computations involving it. We can find the device of a tensor with the device
property.
# Initialize an example tensor
x = torch.Tensor([[1, 2], [3, 4]])
x
tensor([[1., 2.], [3., 4.]])
# Get the device of the tensor
x.device
device(type='cpu')
We can move a tensor from one device to another with the method to(device)
.
# Check if a GPU is available, if so, move the tensor to the GPU
if torch.cuda.is_available():
x.to('cuda')
In PyTorch
we can index tensors, similar to NumPy
.
# Initialize an example tensor
x = torch.Tensor([
[[1, 2], [3, 4]],
[[5, 6], [7, 8]],
[[9, 10], [11, 12]]
])
x
tensor([[[ 1., 2.], [ 3., 4.]], [[ 5., 6.], [ 7., 8.]], [[ 9., 10.], [11., 12.]]])
x.shape
torch.Size([3, 2, 2])
# Access the 0th element, which is the first row
x[0] # Equivalent to x[0, :]
tensor([[1., 2.], [3., 4.]])
We can also index into multiple dimensions with :
.
# Get the top left element of each element in our tensor
x[:, 0, 0]
tensor([1., 5., 9.])
We can also access arbitrary elements in each dimension.
# Print x again to see our tensor
x
tensor([[[ 1., 2.], [ 3., 4.]], [[ 5., 6.], [ 7., 8.]], [[ 9., 10.], [11., 12.]]])
# Let's access the 0th and 1st elements, each twice
i = torch.tensor([0, 0, 1, 1])
x[i]
tensor([[[1., 2.], [3., 4.]], [[1., 2.], [3., 4.]], [[5., 6.], [7., 8.]], [[5., 6.], [7., 8.]]])
# Let's access the 0th elements of the 1st and 2nd elements
i = torch.tensor([1, 2])
j = torch.tensor([0])
x[i, j]
tensor([[ 5., 6.], [ 9., 10.]])
We can get a Python
scalar value from a tensor with item()
.
x[0, 0, 0]
tensor(1.)
x[0, 0, 0].item()
1.0
PyTorch operations are very similar to those of NumPy
. We can work with both scalars and other tensors.
# Create an example tensor
x = torch.ones((3,2,2))
x
tensor([[[1., 1.], [1., 1.]], [[1., 1.], [1., 1.]], [[1., 1.], [1., 1.]]])
# Perform elementwise addition
# Use - for subtraction
x + 2
tensor([[[3., 3.], [3., 3.]], [[3., 3.], [3., 3.]], [[3., 3.], [3., 3.]]])
# Perform elementwise multiplication
# Use / for division
x * 2
tensor([[[2., 2.], [2., 2.]], [[2., 2.], [2., 2.]], [[2., 2.], [2., 2.]]])
We can apply the same operations between different tensors of compatible sizes.
# Create a 4x3 tensor of 6s
a = torch.ones((4,3)) * 6
a
tensor([[6., 6., 6.], [6., 6., 6.], [6., 6., 6.], [6., 6., 6.]])
# Create a 1D tensor of 2s
b = torch.ones(3) * 2
b
tensor([2., 2., 2.])
# Divide a by b
a / b
tensor([[3., 3., 3.], [3., 3., 3.], [3., 3., 3.], [3., 3., 3.]])
We can use tensor.matmul(other_tensor)
for matrix multiplication and tensor.T
for transpose. Matrix multiplication can also be performed with @
.
# Alternative to a.matmul(b)
# a @ b.T returns the same result since b is 1D tensor and the 2nd dimension
# is inferred
a @ b
tensor([36., 36., 36., 36.])
pp.pprint(a.shape)
pp.pprint(a.T.shape)
torch.Size([4, 3]) torch.Size([3, 4])
We can take the mean and standard deviation along a certain dimension with the methods mean(dim)
and std(dim)
. That is, if we want to get the mean 3x2
matrix in a 4x3x2
matrix, we would set the dim
to be 0. We can call these methods with no parameter to get the mean and standard deviation for the whole tensor. To use mean
and std
our tensor should be a floating point type.
# Create an example tensor
m = torch.tensor(
[
[1., 1.],
[2., 2.],
[3., 3.],
[4., 4.]
]
)
pp.pprint("Mean: {}".format(m.mean()))
pp.pprint("Mean in the 0th dimension: {}".format(m.mean(0)))
pp.pprint("Mean in the 1st dimension: {}".format(m.mean(1)))
'Mean: 2.5' 'Mean in the 0th dimension: tensor([2.5000, 2.5000])' 'Mean in the 1st dimension: tensor([1., 2., 3., 4.])'
We can concatenate tensors using torch.cat
.
# Concatenate in dimension 0 and 1
a_cat0 = torch.cat([a, a, a], dim=0)
a_cat1 = torch.cat([a, a, a], dim=1)
print("Initial shape: {}".format(a.shape))
print("Shape after concatenation in dimension 0: {}".format(a_cat0.shape))
print("Shape after concatenation in dimension 1: {}".format(a_cat1.shape))
Initial shape: torch.Size([4, 3]) Shape after concatenation in dimension 0: torch.Size([12, 3]) Shape after concatenation in dimension 1: torch.Size([4, 9])
Most of the operations in PyTorch
are not in place. However, PyTorch
offers the in place versions of operations available by adding an underscore (_
) at the end of the method name.
# Print our tensor
a
tensor([[6., 6., 6.], [6., 6., 6.], [6., 6., 6.], [6., 6., 6.]])
# add() is not in place
a.add(a)
a
tensor([[6., 6., 6.], [6., 6., 6.], [6., 6., 6.], [6., 6., 6.]])
# add_() is in place
a.add_(a)
a
tensor([[48., 48., 48.], [48., 48., 48.], [48., 48., 48.], [48., 48., 48.]])
PyTorch
and other machine learning libraries are known for their automatic differantiation feature. That is, given that we have defined the set of operations that need to be performed, the framework itself can figure out how to compute the gradients. We can call the backward()
method to ask PyTorch
to calculate the gradiends, which are then stored in the grad
attribute.
# Create an example tensor
# requires_grad parameter tells PyTorch to store gradients
x = torch.tensor([2.], requires_grad=True)
# Print the gradient if it is calculated
# Currently None since x is a scalar
pp.pprint(x.grad)
None
# Calculating the gradient of y with respect to x
y = x * x * 3 # 3x^2
y.backward()
pp.pprint(x.grad) # d(y)/d(x) = d(3x^2)/d(x) = 6x = 12
tensor([12.])
Let's run backprop from a different tensor again to see what happens.
z = x * x * 3 # 3x^2
z.backward()
pp.pprint(x.grad)
tensor([48.])
We can see that the x.grad
is updated to be the sum of the gradients calculated so far. When we run backprop in a neural network, we sum up all the gradients for a particular neuron before making an update. This is exactly what is happening here! This is also the reason why we need to run zero_grad()
in every training iteration (more on this later). Otherwise our gradients would keep building up from one training iteration to the other, which would cause our updates to be wrong.
So far we have looked into the tensors, their properties and basic operations on tensors. These are especially useful to get familiar with if we are building the layers of our network from scratch. We will utilize these in Assignment 3, but moving forward, we will use predefined blocks in the torch.nn
module of PyTorch
. We will then put together these blocks to create complex networks. Let's start by importing this module with an alias so that we don't have to type torch
every time we use it.
import torch.nn as nn
We can use nn.Linear(H_in, H_out)
to create a a linear layer. This will take a matrix of (N, *, H_in)
dimensions and output a matrix of (N, *, H_out)
. The *
denotes that there could be arbitrary number of dimensions in between. The linear layer performs the operation Ax+b
, where A
and b
are initialized randomly. If we don't want the linear layer to learn the bias parameters, we can initialize our layer with bias=False
.
# Create the inputs
input = torch.ones(2,3,4)
# N* H_in -> N*H_out
# Make a linear layers transforming N,*,H_in dimensinal inputs to N,*,H_out
# dimensional outputs
linear = nn.Linear(4, 2)
nn.Linear(2,1)
linear_output = linear(input)
linear_output
tensor([[[-0.0935, 0.6382], [-0.0935, 0.6382], [-0.0935, 0.6382]], [[-0.0935, 0.6382], [-0.0935, 0.6382], [-0.0935, 0.6382]]], grad_fn=<AddBackward0>)
list(linear.parameters()) # Ax + b
[Parameter containing: tensor([[-0.2491, 0.2283, 0.2765, -0.4489], [ 0.3642, 0.0685, -0.3154, 0.2699]], requires_grad=True), Parameter containing: tensor([0.0997, 0.2510], requires_grad=True)]
There are several other preconfigured layers in the nn
module. Some commonly used examples are nn.Conv2d
, nn.ConvTranspose2d
, nn.BatchNorm1d
, nn.BatchNorm2d
, nn.Upsample
and nn.MaxPool2d
among many others. We will learn more about these as we progress in the course. For now, the only important thing to remember is that we can treat each of these layers as plug and play components: we will be providing the required dimensions and PyTorch
will take care of setting them up.
We can also use the nn
module to apply activations functions to our tensors. Activation functions are used to add non-linearity to our network. Some examples of activations functions are nn.ReLU()
, nn.Sigmoid()
and nn.LeakyReLU()
. Activation functions operate on each element seperately, so the shape of the tensors we get as an output are the same as the ones we pass in.
linear_output
tensor([[[-0.0935, 0.6382], [-0.0935, 0.6382], [-0.0935, 0.6382]], [[-0.0935, 0.6382], [-0.0935, 0.6382], [-0.0935, 0.6382]]], grad_fn=<AddBackward0>)
sigmoid = nn.Sigmoid()
output = sigmoid(linear_output)
output
tensor([[[0.4766, 0.6543], [0.4766, 0.6543], [0.4766, 0.6543]], [[0.4766, 0.6543], [0.4766, 0.6543], [0.4766, 0.6543]]], grad_fn=<SigmoidBackward>)
So far we have seen that we can create layers and pass the output of one as the input of the next. Instead of creating intermediate tensors and passing them around, we can use nn.Sequentual
, which does exactly that.
block = nn.Sequential(
nn.Linear(4, 2),
nn.Sigmoid()
)
input = torch.ones(2,3,4)
output = block(input)
output
tensor([[[0.3116, 0.8282], [0.3116, 0.8282], [0.3116, 0.8282]], [[0.3116, 0.8282], [0.3116, 0.8282], [0.3116, 0.8282]]], grad_fn=<SigmoidBackward>)
Instead of using the predefined modules, we can also build our own by extending the nn.Module
class. For example, we can build a the nn.Linear
(which also extends nn.Module
) on our own using the tensor introduced earlier! We can also build new, more complex modules, such as a custom neural network. You will be practicing these in the later assignment.
To create a custom module, the first thing we have to do is to extend the nn.Module
. We can then initialize our parameters in the __init__
function, starting with a call to the __init__
function of the super class. All the class attributes we define which are nn
module objects are treated as parameters, which can be learned during the training. Tensors are not parameters, but they can be turned into parameters if they are wrapped in nn.Parameter
class.
All classes extending nn.Module
are also expected to implement a forward(x)
function, where x
is a tensor. This is the function that is called when a parameter is passed to our module, such as in model(x)
.
class MultilayerPerceptron(nn.Module):
def __init__(self, input_size, hidden_size):
# Call to the __init__ function of the super class
super(MultilayerPerceptron, self).__init__()
# Bookkeeping: Saving the initialization parameters
self.input_size = input_size
self.hidden_size = hidden_size
# Defining of our model
# There isn't anything specific about the naming of `self.model`. It could
# be something arbitrary.
self.model = nn.Sequential(
nn.Linear(self.input_size, self.hidden_size),
nn.ReLU(),
nn.Linear(self.hidden_size, self.input_size),
nn.Sigmoid()
)
def forward(self, x):
output = self.model(x)
return output
Here is an alternative way to define the same class. You can see that we can replace nn.Sequential
by defining the individual layers in the __init__
method and connecting the in the forward
method.
class MultilayerPerceptron(nn.Module):
def __init__(self, input_size, hidden_size):
# Call to the __init__ function of the super class
super(MultilayerPerceptron, self).__init__()
# Bookkeeping: Saving the initialization parameters
self.input_size = input_size
self.hidden_size = hidden_size
# Defining of our layers
self.linear = nn.Linear(self.input_size, self.hidden_size)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(self.hidden_size, self.input_size)
self.sigmoid = nn.Sigmoid()
def forward(self, x):
linear = self.linear(x)
relu = self.relu(linear)
linear2 = self.linear2(relu)
output = self.sigmoid(linear2)
return output
Now that we have defined our class, we can instantiate it and see what it does.
# Make a sample input
input = torch.randn(2, 5)
# Create our model
model = MultilayerPerceptron(5, 3)
# Pass our input through our model
model(input)
tensor([[0.6960, 0.5888, 0.6302, 0.5337, 0.6120], [0.6787, 0.5964, 0.6672, 0.4974, 0.6041]], grad_fn=<SigmoidBackward>)
We can inspect the parameters of our model with named_parameters()
and parameters()
methods.
list(model.named_parameters())
[('linear.weight', Parameter containing: tensor([[-0.0094, -0.3072, 0.2230, 0.0499, -0.0917], [ 0.0116, -0.2261, -0.4170, -0.1688, 0.2925], [ 0.4049, 0.2189, 0.1391, 0.2115, -0.3926]], requires_grad=True)), ('linear.bias', Parameter containing: tensor([0.1696, 0.2785, 0.3635], requires_grad=True)), ('linear2.weight', Parameter containing: tensor([[ 0.4921, 0.5605, 0.5188], [ 0.4088, 0.4430, 0.0042], [-0.2919, 0.2893, -0.4794], [ 0.4321, -0.1348, 0.4558], [-0.4387, 0.2400, 0.3511]], requires_grad=True)), ('linear2.bias', Parameter containing: tensor([ 0.2369, -0.0131, 0.4319, 0.1126, 0.2039], requires_grad=True))]
We have showed how gradients are calculated with the backward()
function. Having the gradients isn't enought for our models to learn. We also need to know how to update the parameters of our models. This is where the optomozers comes in. torch.optim
module contains several optimizers that we can use. Some popular examples are optim.SGD
and optim.Adam
. When initializing optimizers, we pass our model parameters, which can be accessed with model.parameters()
, telling the optimizers which values it will be optimizing. Optimizers also has a learning rate (lr
) parameter, which determines how big of an update will be made in every step. Different optimizers have different hyperparameters as well.
import torch.optim as optim
After we have our optimization function, we can define a loss
that we want to optimize for. We can either define the loss ourselves, or use one of the predefined loss function in PyTorch
, such as nn.BCELoss()
. Let's put everything together now! We will start by creating some dummy data.
# Create the y data
y = torch.ones(10, 5)
# Add some noise to our goal y to generate our x
# We want out model to predict our original data, albeit the noise
x = y + torch.randn_like(y)
x
tensor([[ 1.6387, 0.7748, 1.7080, 0.2209, 1.9849], [-0.4608, 0.8775, 1.4027, 2.0996, 1.6603], [ 1.5067, 2.1198, 1.8461, 2.6047, 1.5850], [ 0.4918, 0.9058, 1.6317, 0.9045, 0.4642], [-0.0420, 2.1190, -0.1469, 0.5251, 1.7798], [ 1.8992, 1.2926, 0.5929, 1.4380, 0.6741], [ 0.2726, 1.5211, 0.5603, 3.1195, 1.8431], [ 1.1206, 0.8492, 1.2665, 2.8705, 0.7252], [-0.6805, 1.4661, 1.5455, 2.5870, -0.1636], [ 2.4966, 2.1783, 1.8534, 1.7078, -0.0653]])
Now, we can define our model, optimizer and the loss function.
# Instantiate the model
model = MultilayerPerceptron(5, 3)
# Define the optimizer
adam = optim.Adam(model.parameters(), lr=1e-1)
# Define loss using a predefined loss function
loss_function = nn.BCELoss()
# Calculate how our model is doing now
y_pred = model(x)
loss_function(y_pred, y).item()
0.7652291059494019
Let's see if we can have our model achieve a smaller loss. Now that we have everything we need, we can setup our training loop.
# Set the number of epoch, which determines the number of training iterations
n_epoch = 10
for epoch in range(n_epoch):
# Set the gradients to 0
adam.zero_grad()
# Get the model predictions
y_pred = model(x)
# Get the loss
loss = loss_function(y_pred, y)
# Print stats
print(f"Epoch {epoch}: traing loss: {loss}")
# Compute the gradients
loss.backward()
# Take a step to optimize the weights
adam.step()
Epoch 0: traing loss: 0.7652291059494019 Epoch 1: traing loss: 0.5666258335113525 Epoch 2: traing loss: 0.3737648129463196 Epoch 3: traing loss: 0.21681034564971924 Epoch 4: traing loss: 0.1113014817237854 Epoch 5: traing loss: 0.05232418701052666 Epoch 6: traing loss: 0.023548150435090065 Epoch 7: traing loss: 0.01053957361727953 Epoch 8: traing loss: 0.004811116959899664 Epoch 9: traing loss: 0.002271441277116537
list(model.parameters())
[Parameter containing: tensor([[ 0.0044, -0.7197, -0.7594, -0.1397, -0.0485], [-0.0994, -0.1889, -0.1952, -0.5861, -0.8391], [ 0.5859, 0.7440, 0.6009, 1.1375, 0.9526]], requires_grad=True), Parameter containing: tensor([-0.6889, -0.4001, 0.5036], requires_grad=True), Parameter containing: tensor([[ 0.4578, 0.1112, 1.1189], [-0.0121, 0.2187, 1.3968], [ 0.3651, 0.3457, 1.0704], [-0.0348, 0.2022, 1.3452], [ 0.3004, 0.6618, 1.3298]], requires_grad=True), Parameter containing: tensor([0.8258, 1.2799, 0.7956, 0.2157, 0.4042], requires_grad=True)]
You can see that our loss is decreasing. Let's check the predictions of our model now and see if they are close to our original y
, which was all 1s
.
# See how our model performs on the training data
y_pred = model(x)
y_pred
tensor([[0.9987, 0.9998, 0.9983, 0.9993, 0.9993], [0.9993, 0.9999, 0.9990, 0.9996, 0.9997], [1.0000, 1.0000, 1.0000, 1.0000, 1.0000], [0.9946, 0.9988, 0.9932, 0.9959, 0.9964], [0.9963, 0.9993, 0.9953, 0.9974, 0.9977], [0.9987, 0.9998, 0.9983, 0.9993, 0.9993], [0.9999, 1.0000, 0.9998, 1.0000, 1.0000], [0.9997, 1.0000, 0.9996, 0.9999, 0.9999], [0.9982, 0.9997, 0.9977, 0.9989, 0.9990], [0.9997, 1.0000, 0.9996, 0.9999, 0.9999]], grad_fn=<SigmoidBackward>)
# Create test data and check how our model performs on it
x2 = y + torch.randn_like(y)
y_pred = model(x2)
y_pred
tensor([[0.9998, 1.0000, 0.9998, 0.9999, 0.9999], [0.8011, 0.8795, 0.7922, 0.7105, 0.7462], [0.9993, 0.9999, 0.9990, 0.9996, 0.9997], [0.9999, 1.0000, 0.9998, 1.0000, 1.0000], [0.9998, 1.0000, 0.9997, 0.9999, 0.9999], [0.9987, 0.9998, 0.9983, 0.9993, 0.9994], [0.6955, 0.7824, 0.6890, 0.5537, 0.5997], [0.9951, 0.9990, 0.9939, 0.9964, 0.9968], [0.9978, 0.9996, 0.9971, 0.9986, 0.9987], [0.9994, 0.9999, 0.9991, 0.9997, 0.9997]], grad_fn=<SigmoidBackward>)
Great! Looks like our model almost perfectly learned to filter out the noise from the x
that we passed in!
Until this part of the notebook, we have learned the fundamentals of PyTorch and built a basic network solving a toy task. Now we will attempt to solve an example NLP task. Here are the things we will learn:
In this section, our goal will be to train a model that will find the words in a sentence corresponding to a LOCATION
, which will be always of span 1
(meaning that San Fransisco
won't be recognized as a LOCATION
). Our task is called Word Window Classification
for a reason. Instead of letting our model to only take a look at one word in each forward pass, we would like it to be able to consider the context of the word in question. That is, for each word, we want our model to be aware of the surrounding words. Let's dive in!
The very first task of any machine learning project is to set up our training set. Usually, there will be a training corpus we will be utilizing. In NLP tasks, the corpus would generally be a .txt
or .csv
file where each row corresponds to a sentence or a tabular datapoint. In our toy task, we will assume that we have already read our data and the corresponding labels into a Python
list.
# Our raw data, which consists of sentences
corpus = [
"We always come to Paris",
"The professor is from Australia",
"I live in Stanford",
"He comes from Taiwan",
"The capital of Turkey is Ankara"
]
To make it easier for our models to learn, we usually apply a few preprocessing steps to our data. This is especially important when dealing with text data. Here are some examples of text preprocessing:
Which preprocessing steps are necessary is determined by the task at hand. For example, although it is useful to remove special characters in some tasks, for others they may be important (for example, if we are dealing with multiple languages). For our task, we will lowercase our words and tokenize.
# The preprocessing function we will use to generate our training examples
# Our function is a simple one, we lowercase the letters
# and then tokenize the words.
def preprocess_sentence(sentence):
return sentence.lower().split()
# Create our training set
train_sentences = [sent.lower().split() for sent in corpus]
train_sentences
[['we', 'always', 'come', 'to', 'paris'], ['the', 'professor', 'is', 'from', 'australia'], ['i', 'live', 'in', 'stanford'], ['he', 'comes', 'from', 'taiwan'], ['the', 'capital', 'of', 'turkey', 'is', 'ankara']]
For each training example we have, we should also have a corresponding label. Recall that the goal of our model was to determine which words correspond to a LOCATION
. That is, we want our model to output 0
for all the words that are not LOCATION
s and 1
for the ones that are LOCATION
s.
# Set of locations that appear in our corpus
locations = set(["australia", "ankara", "paris", "stanford", "taiwan", "turkey"])
# Our train labels
train_labels = [[1 if word in locations else 0 for word in sent] for sent in train_sentences]
train_labels
[[0, 0, 0, 0, 1], [0, 0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1, 0, 1]]
Let's look at our training data a little more closely. Each datapoint we have is a sequence of words. On the other hand, we know that machine learning models work with numbers in vectors. How are we going to turn words into numbers? You may be thinking embeddings and you are right!
Imagine that we have an embedding lookup table E
, where each row corresponds to an embedding. That is, each word in our vocabulary would have a corresponding embedding row i
in this table. Whenever we want to find an embedding for a word, we will follow these steps:
i
of the word in the embedding table: word->index
.index->embedding
.Let's look at the first step. We should assign all the words in our vocabulary to a corresponding index. We can do it as follows:
# Find all the unique words in our corpus
vocabulary = set(w for s in train_sentences for w in s)
vocabulary
{'always', 'ankara', 'australia', 'capital', 'come', 'comes', 'from', 'he', 'i', 'in', 'is', 'live', 'of', 'paris', 'professor', 'stanford', 'taiwan', 'the', 'to', 'turkey', 'we'}
vocabulary
now contains all the words in our corpus. On the other hand, during the test time, we can see words that are not contained in our vocabulary. If we can figure out a way to represent the unknown words, our model can still reason about whether they are a LOCATION
or not, since we are also looking at the neighboring words for each prediction.
We introduce a special token, <unk>
, to tackle the words that are out of vocabulary. We could pick another string for our unknown token if we wanted. The only requirement here is that our token should be unique: we should only be using this token for unknown words. We will also add this special token to our vocabulary.
# Add the unknown token to our vocabulary
vocabulary.add("<unk>")
Earlier we mentioned that our task was called Word Window Classification
because our model is looking at the surroundings words in addition to the given word when it needs to make a prediction.
For example, let's take the sentence "We always come to Paris". The corresponding training label for this sentence is 0, 0, 0, 0, 1
since only Paris, the last word, is a LOCATION
. In one pass (meaning a call to forward()
), our model will try to generate the correct label for one word. Let's say our model is trying to generate the correct label 1
for Paris
. If we only allow our model to see Paris
, but nothing else, we will miss out on the important information that the word to
often times appears with LOCATION
s.
Word windows allow our model to consider the surrounding +N
or -N
words of each word when making a prediction. In our earlier example for Paris
, if we have a window size of 1, that means our model will look at the words that come immediately before and after Paris
, which are to
, and, well, nothing. Now, this raises another issue. Paris
is at the end of our sentence, so there isn't another word following it. Remember that we define the input dimensions of our PyTorch
models when we are initializing them. If we set the window size to be 1
, it means that our model will be accepting 3
words in every pass. We cannot have our model expect 2
words from time to time.
The solution is to introduce a special token, such as <pad>
, that will be added to our sentences to make sure that every word has a valid window around them. Similar to <unk>
token, we could pick another string for our pad token if we wanted, as long as we make sure it is used for a unique purpose.
# Add the <pad> token to our vocabulary
vocabulary.add("<pad>")
# Function that pads the given sentence
# We are introducing this function here as an example
# We will be utilizing it later in the tutorial
def pad_window(sentence, window_size, pad_token="<pad>"):
window = [pad_token] * window_size
return window + sentence + window
# Show padding example
window_size = 2
pad_window(train_sentences[0], window_size=window_size)
['<pad>', '<pad>', 'we', 'always', 'come', 'to', 'paris', '<pad>', '<pad>']
Now that our vocabularly is ready, let's assign an index to each of our words.
# We are just converting our vocabularly to a list to be able to index into it
# Sorting is not necessary, we sort to show an ordered word_to_ind dictionary
# That being said, we will see that having the index for the padding token
# be 0 is convenient as some PyTorch functions use it as a default value
# such as nn.utils.rnn.pad_sequence, which we will cover in a bit
ix_to_word = sorted(list(vocabulary))
# Creating a dictionary to find the index of a given word
word_to_ix = {word: ind for ind, word in enumerate(ix_to_word)}
word_to_ix
{'<pad>': 0, '<unk>': 1, 'always': 2, 'ankara': 3, 'australia': 4, 'capital': 5, 'come': 6, 'comes': 7, 'from': 8, 'he': 9, 'i': 10, 'in': 11, 'is': 12, 'live': 13, 'of': 14, 'paris': 15, 'professor': 16, 'stanford': 17, 'taiwan': 18, 'the': 19, 'to': 20, 'turkey': 21, 'we': 22}
ix_to_word[1]
'<unk>'
Great! We are ready to convert our training sentences into a sequence of indices corresponding to each token.
# Given a sentence of tokens, return the corresponding indices
def convert_token_to_indices(sentence, word_to_ix):
indices = []
for token in sentence:
# Check if the token is in our vocabularly. If it is, get it's index.
# If not, get the index for the unknown token.
if token in word_to_ix:
index = word_to_ix[token]
else:
index = word_to_ix["<unk>"]
indices.append(index)
return indices
# More compact version of the same function
def _convert_token_to_indices(sentence, word_to_ix):
return [word_to_ind.get(token, word_to_ix["<unk>"]) for token in sentence]
# Show an example
example_sentence = ["we", "always", "come", "to", "kuwait"]
example_indices = convert_token_to_indices(example_sentence, word_to_ix)
restored_example = [ix_to_word[ind] for ind in example_indices]
print(f"Original sentence is: {example_sentence}")
print(f"Going from words to indices: {example_indices}")
print(f"Going from indices to words: {restored_example}")
Original sentence is: ['we', 'always', 'come', 'to', 'kuwait'] Going from words to indices: [22, 2, 6, 20, 1] Going from indices to words: ['we', 'always', 'come', 'to', '<unk>']
In the example above, kuwait
shows up as <unk>
, because it is not included in our vocabulary. Let's convert our train_sentences
to example_padded_indices
.
# Converting our sentences to indices
example_padded_indices = [convert_token_to_indices(s, word_to_ix) for s in train_sentences]
example_padded_indices
[[22, 2, 6, 20, 15], [19, 16, 12, 8, 4], [10, 13, 11, 17], [9, 7, 8, 18], [19, 5, 14, 21, 12, 3]]
Now that we have an index for each word in our vocabularly, we can create an embedding table with nn.Embedding
class in PyTorch
. It is called as follows nn.Embedding(num_words, embedding_dimension)
where num_words
is the number of words in our vocabulary and the embedding_dimension
is the dimension of the embeddings we want to have. There is nothing fancy about nn.Embedding
: it is just a wrapper class around a trainabe NxE
dimensional tensor, where N
is the number of words in our vocabulary and E
is the number of embedding dimensions. This table is initially random, but it will change over time. As we train our network, the gradients will be backpropagated all the way to the embedding layer, and hence our word embeddings would be updated. We will initiliaze the embedding layer we will use for our model in our model, but we are showing an example here.
# Creating an embedding table for our words
embedding_dim = 5
embeds = nn.Embedding(len(vocabulary), embedding_dim)
# Printing the parameters in our embedding table
list(embeds.parameters())
[Parameter containing: tensor([[-0.5421, 0.6919, 0.8236, -1.3510, 1.4048], [ 1.2983, 1.4740, 0.1002, -0.5475, 1.0871], [ 1.4604, -1.4934, -0.4363, -0.3231, -1.9746], [ 0.8021, 1.5121, 0.8239, 0.9865, -1.3801], [ 0.3502, -0.5920, 0.9295, 0.6062, -0.6258], [ 0.5038, -1.0187, 0.2860, 0.3231, -1.2828], [ 1.5232, -0.5983, -0.4971, -0.5137, 1.4319], [ 0.3826, 0.6501, -0.3948, 1.3998, -0.5133], [-0.1728, -0.7658, 0.2873, -2.1812, 0.9506], [-0.5617, 0.4552, 0.0618, -1.7503, 0.2192], [-0.5405, 0.7887, -0.9843, -0.6110, 0.6391], [ 0.6581, -0.7067, 1.3208, 1.3860, -1.5113], [ 1.1594, 0.4977, -1.9175, 0.0916, 0.0085], [ 0.3317, 1.8169, 0.0802, -0.1456, -0.7304], [ 0.4997, -1.4895, 0.1237, -0.4121, 0.8909], [ 0.6732, 0.4117, -0.5378, 0.6632, -2.7096], [-0.4580, -0.9436, -1.6345, 0.1284, -1.6147], [-0.3537, 1.9635, 1.0702, -0.1894, -0.8822], [-0.4057, -1.2033, -0.7083, 0.4087, -1.1708], [-0.6373, 0.5272, 1.8711, -0.5865, -0.7643], [ 0.4714, -2.5822, 0.4338, 0.1537, -0.7650], [-2.1828, 1.3178, 1.3833, 0.5018, -1.7209], [-0.5354, 0.2153, -0.1482, 0.3903, 0.0900]], requires_grad=True)]
To get the word embedding for a word in our vocabulary, all we need to do is to create a lookup tensor. The lookup tensor is just a tensor containing the index we want to look up nn.Embedding
class expects an index tensor that is of type Long Tensor, so we should create our tensor accordingly.
# Get the embedding for the word Paris
index = word_to_ix["paris"]
index_tensor = torch.tensor(index, dtype=torch.long)
paris_embed = embeds(index_tensor)
paris_embed
tensor([ 0.6732, 0.4117, -0.5378, 0.6632, -2.7096], grad_fn=<EmbeddingBackward>)
# We can also get multiple embeddings at once
index_paris = word_to_ix["paris"]
index_ankara = word_to_ix["ankara"]
indices = [index_paris, index_ankara]
indices_tensor = torch.tensor(indices, dtype=torch.long)
embeddings = embeds(indices_tensor)
embeddings
tensor([[ 0.6732, 0.4117, -0.5378, 0.6632, -2.7096], [ 0.8021, 1.5121, 0.8239, 0.9865, -1.3801]], grad_fn=<EmbeddingBackward>)
Usually, we define the embedding layer as part of our model, which you will see in the later sections of our notebook.
We have learned about batches in class. Waiting our whole training corpus to be processed before making an update is constly. On the other hand, updating the parameters after every training example causes the loss to be less stable between updates. To combat these issues, we instead update our parameters after training on a batch of data. This allows us to get a better estimate of the gradient of the global loss. In this section, we will learn how to structure our data into batches using the torch.util.data.DataLoader
class.
We will be calling the DataLoader
class as follows: DataLoader(data, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)
. The batch_size
parameter determines the number of examples per batch. In every epoch, we will be iterating over all the batches using the DataLoader
. The order of batches is deterministic by default, but we can ask DataLoader
to shuffle the batches by setting the shuffle
parameter to True
. This way we ensure that we don't encounter a bad batch multiple times.
If provided, DataLoader
passes the batches it prepares to the collate_fn
. We can write a custom function to pass to the collate_fn
parameter in order to print stats about our batch or perform extra processing. In our case, we will use the collate_fn
to:
collate_fn
parameter.Because our version of the collate_fn
function will need to access to our word_to_ix
dictionary (so that it can turn words into indices), we will make use of the partial
function in Python
, which passes the parameters we give to the function we pass it.
from torch.utils.data import DataLoader
from functools import partial
def custom_collate_fn(batch, window_size, word_to_ix):
# Break our batch into the training examples (x) and labels (y)
# We are turning our x and y into tensors because nn.utils.rnn.pad_sequence
# method expects tensors. This is also useful since our model will be
# expecting tensor inputs.
x, y = zip(*batch)
# Now we need to window pad our training examples. We have already defined a
# function to handle window padding. We are including it here again so that
# everything is in one place.
def pad_window(sentence, window_size, pad_token="<pad>"):
window = [pad_token] * window_size
return window + sentence + window
# Pad the train examples.
x = [pad_window(s, window_size=window_size) for s in x]
# Now we need to turn words in our training examples to indices. We are
# copying the function defined earlier for the same reason as above.
def convert_tokens_to_indices(sentence, word_to_ix):
return [word_to_ix.get(token, word_to_ix["<unk>"]) for token in sentence]
# Convert the train examples into indices.
x = [convert_tokens_to_indices(s, word_to_ix) for s in x]
# We will now pad the examples so that the lengths of all the example in
# one batch are the same, making it possible to do matrix operations.
# We set the batch_first parameter to True so that the returned matrix has
# the batch as the first dimension.
pad_token_ix = word_to_ix["<pad>"]
# pad_sequence function expects the input to be a tensor, so we turn x into one
x = [torch.LongTensor(x_i) for x_i in x]
x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)
# We will also pad the labels. Before doing so, we will record the number
# of labels so that we know how many words existed in each example.
lengths = [len(label) for label in y]
lenghts = torch.LongTensor(lengths)
y = [torch.LongTensor(y_i) for y_i in y]
y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)
# We are now ready to return our variables. The order we return our variables
# here will match the order we read them in our training loop.
return x_padded, y_padded, lenghts
This function seems long, but it really doesn't have to be. Check out the alternative version below where we remove the extra function declarations and comments.
def _custom_collate_fn(batch, window_size, word_to_ix):
# Prepare the datapoints
x, y = zip(*batch)
x = [pad_window(s, window_size=window_size) for s in x]
x = [convert_tokens_to_indices(s, word_to_ix) for s in x]
# Pad x so that all the examples in the batch have the same size
pad_token_ix = word_to_ix["<pad>"]
x = [torch.LongTensor(x_i) for x_i in x]
x_padded = nn.utils.rnn.pad_sequence(x, batch_first=True, padding_value=pad_token_ix)
# Pad y and record the length
lengths = [len(label) for label in y]
lenghts = torch.LongTensor(lengths)
y = [torch.LongTensor(y_i) for y_i in y]
y_padded = nn.utils.rnn.pad_sequence(y, batch_first=True, padding_value=0)
return x_padded, y_padded, lenghts
Now, we can see the DataLoader
in action.
# Parameters to be passed to the DataLoader
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)
# Instantiate the DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)
# Go through one loop
counter = 0
for batched_x, batched_y, batched_lengths in loader:
print(f"Iteration {counter}")
print("Batched Input:")
print(batched_x)
print("Batched Labels:")
print(batched_y)
print("Batched Lengths:")
print(batched_lengths)
print("")
counter += 1
Iteration 0 Batched Input: tensor([[ 0, 0, 22, 2, 6, 20, 15, 0, 0], [ 0, 0, 19, 16, 12, 8, 4, 0, 0]]) Batched Labels: tensor([[0, 0, 0, 0, 1], [0, 0, 0, 0, 1]]) Batched Lengths: tensor([5, 5]) Iteration 1 Batched Input: tensor([[ 0, 0, 19, 5, 14, 21, 12, 3, 0, 0], [ 0, 0, 10, 13, 11, 17, 0, 0, 0, 0]]) Batched Labels: tensor([[0, 0, 0, 1, 0, 1], [0, 0, 0, 1, 0, 0]]) Batched Lengths: tensor([6, 4]) Iteration 2 Batched Input: tensor([[ 0, 0, 9, 7, 8, 18, 0, 0]]) Batched Labels: tensor([[0, 0, 0, 1]]) Batched Lengths: tensor([4])
The batched input tensors you see above will be passed into our model. On the other hand, we started off saying that our model will be a window classifier. The way our input tensors are currently formatted, we have all the words in a sentence in one datapoint. When we pass this input to our model, it needs to create the windows for each word, make a prediction as to whether the center word is a LOCATION
or not for each window, put the predictions together and return.
We could avoid this problem if we formatted our data by breaking it into windows beforehand. In this example, we will instead how our model take care of the formatting.
Given that our window_size
is N
we want our model to make a prediction on every 2N+1
tokens. That is, if we have an input with 9
tokens, and a window_size
of 2
, we want our model to return 5
predictions. This makes sense because before we padded it with 2
tokens on each side, our input also had 5
tokens in it!
We can create these windows by using for loops, but there is a faster PyTorch
alternative, which is the unfold(dimension, size, step)
method. We can create the windows we need using this method as follows:
# Print the original tensor
print(f"Original Tensor: ")
print(batched_x)
print("")
# Create the 2 * 2 + 1 chunks
chunk = batched_x.unfold(1, window_size*2 + 1, 1)
print(f"Windows: ")
print(chunk)
Original Tensor: tensor([[ 0, 0, 9, 7, 8, 18, 0, 0]]) Windows: tensor([[[ 0, 0, 9, 7, 8], [ 0, 9, 7, 8, 18], [ 9, 7, 8, 18, 0], [ 7, 8, 18, 0, 0]]])
Now that we have prepared our data, we are ready to build our model. We have learned how to write custom nn.Module
classes. We will do the same here and put everything we have learned so far together.
class WordWindowClassifier(nn.Module):
def __init__(self, hyperparameters, vocab_size, pad_ix=0):
super(WordWindowClassifier, self).__init__()
""" Instance variables """
self.window_size = hyperparameters["window_size"]
self.embed_dim = hyperparameters["embed_dim"]
self.hidden_dim = hyperparameters["hidden_dim"]
self.freeze_embeddings = hyperparameters["freeze_embeddings"]
""" Embedding Layer
Takes in a tensor containing embedding indices, and returns the
corresponding embeddings. The output is of dim
(number_of_indices * embedding_dim).
If freeze_embeddings is True, set the embedding layer parameters to be
non-trainable. This is useful if we only want the parameters other than the
embeddings parameters to change.
"""
self.embeds = nn.Embedding(vocab_size, self.embed_dim, padding_idx=pad_ix)
if self.freeze_embeddings:
self.embed_layer.weight.requires_grad = False
""" Hidden Layer
"""
full_window_size = 2 * window_size + 1
self.hidden_layer = nn.Sequential(
nn.Linear(full_window_size * self.embed_dim, self.hidden_dim),
nn.Tanh()
)
""" Output Layer
"""
self.output_layer = nn.Linear(self.hidden_dim, 1)
""" Probabilities
"""
self.probabilities = nn.Sigmoid()
def forward(self, inputs):
"""
Let B:= batch_size
L:= window-padded sentence length
D:= self.embed_dim
S:= self.window_size
H:= self.hidden_dim
inputs: a (B, L) tensor of token indices
"""
B, L = inputs.size()
"""
Reshaping.
Takes in a (B, L) LongTensor
Outputs a (B, L~, S) LongTensor
"""
# Fist, get our word windows for each word in our input.
token_windows = inputs.unfold(1, 2 * self.window_size + 1, 1)
_, adjusted_length, _ = token_windows.size()
# Good idea to do internal tensor-size sanity checks, at the least in comments!
assert token_windows.size() == (B, adjusted_length, 2 * self.window_size + 1)
"""
Embedding.
Takes in a torch.LongTensor of size (B, L~, S)
Outputs a (B, L~, S, D) FloatTensor.
"""
embedded_windows = self.embeds(token_windows)
"""
Reshaping.
Takes in a (B, L~, S, D) FloatTensor.
Resizes it into a (B, L~, S*D) FloatTensor.
-1 argument "infers" what the last dimension should be based on leftover axes.
"""
embedded_windows = embedded_windows.view(B, adjusted_length, -1)
"""
Layer 1.
Takes in a (B, L~, S*D) FloatTensor.
Resizes it into a (B, L~, H) FloatTensor
"""
layer_1 = self.hidden_layer(embedded_windows)
"""
Layer 2
Takes in a (B, L~, H) FloatTensor.
Resizes it into a (B, L~, 1) FloatTensor.
"""
output = self.output_layer(layer_1)
"""
Softmax.
Takes in a (B, L~, 1) FloatTensor of unnormalized class scores.
Outputs a (B, L~, 1) FloatTensor of (log-)normalized class scores.
"""
output = self.probabilities(output)
output = output.view(B, -1)
return output
We are now ready to put everything together. Let's start with preparing our data and intializing our model. We can then intialize our optimizer and define our loss function. This time, instead of using one of the predefined loss function as we did before, we will define our own loss function.
# Prepare the data
data = list(zip(train_sentences, train_labels))
batch_size = 2
shuffle = True
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=window_size, word_to_ix=word_to_ix)
# Instantiate a DataLoader
loader = DataLoader(data, batch_size=batch_size, shuffle=shuffle, collate_fn=collate_fn)
# Initialize a model
# It is useful to put all the model hyperparameters in a dictionary
model_hyperparameters = {
"batch_size": 4,
"window_size": 2,
"embed_dim": 25,
"hidden_dim": 25,
"freeze_embeddings": False,
}
vocab_size = len(word_to_ix)
model = WordWindowClassifier(model_hyperparameters, vocab_size)
# Define an optimizer
learning_rate = 0.01
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
# Define a loss function, which computes to binary cross entropy loss
def loss_function(batch_outputs, batch_labels, batch_lengths):
# Calculate the loss for the whole batch
bceloss = nn.BCELoss()
loss = bceloss(batch_outputs, batch_labels.float())
# Rescale the loss. Remember that we have used lengths to store the
# number of words in each training example
loss = loss / batch_lengths.sum().float()
return loss
Unlike our earlier example, this time instead of passing all of our training data to the model at once in each epoch, we will be utilizing batches. Hence, in each training epoch iteration, we also iterate over the batches.
# Function that will be called in every epoch
def train_epoch(loss_function, optimizer, model, loader):
# Keep track of the total loss for the batch
total_loss = 0
for batch_inputs, batch_labels, batch_lengths in loader:
# Clear the gradients
optimizer.zero_grad()
# Run a forward pass
outputs = model.forward(batch_inputs)
# Compute the batch loss
loss = loss_function(outputs, batch_labels, batch_lengths)
# Calculate the gradients
loss.backward()
# Update the parameteres
optimizer.step()
total_loss += loss.item()
return total_loss
# Function containing our main training loop
def train(loss_function, optimizer, model, loader, num_epochs=10000):
# Iterate through each epoch and call our train_epoch function
for epoch in range(num_epochs):
epoch_loss = train_epoch(loss_function, optimizer, model, loader)
if epoch % 100 == 0: print(epoch_loss)
Let's start training!
num_epochs = 1000
train(loss_function, optimizer, model, loader, num_epochs=num_epochs)
0.3274914249777794 0.24941639229655266 0.1968013420701027 0.1381114460527897 0.11672545038163662 0.09148690290749073 0.07141915801912546 0.05857925023883581 0.04900792893022299 0.04107789508998394
Let's see how well our model is at making predictions. We can start by creating our test data.
# Create test sentences
test_corpus = ["She comes from Paris"]
test_sentences = [s.lower().split() for s in test_corpus]
test_labels = [[0, 0, 0, 1]]
# Create a test loader
test_data = list(zip(test_sentences, test_labels))
batch_size = 1
shuffle = False
window_size = 2
collate_fn = partial(custom_collate_fn, window_size=2, word_to_ix=word_to_ix)
test_loader = torch.utils.data.DataLoader(test_data,
batch_size=1,
shuffle=False,
collate_fn=collate_fn)
Let's loop over our test examples to see how well we are doing.
for test_instance, labels, _ in test_loader:
outputs = model.forward(test_instance)
print(labels)
print(outputs)
tensor([[0, 0, 0, 1]]) tensor([[0.0339, 0.1031, 0.0500, 0.9770]], grad_fn=<ViewBackward>)