Week 2 - Ungraded Lab: A journey through Data

Welcome to the ungraded lab for week 2 of Machine Learning Engineering for Production. The paradigm behind Deep Learning is now facing a shift from model-centric to data-centric. In this lab you will see how data intricacies affect the outcome of your models. To show you how far it will take you to apply data changes without addressing the model, you will be using a single model throughout: a simple Convolutional Neural Network (CNN). While training this model the journey will take you to address common problems: class imbalance and overfitting. As you navigate these issues, the lab will walk you through useful diagnosis tools and methods to mitigate these common problems.



IMPORTANT NOTES BEFORE STARTING THE LAB

Once opened in Colab, click on the "Connect" button on the upper right side corner of the screen to connect to a runtime to run this lab.

NOTE 1:

For this lab you get the option to either train the models yourself (this takes around 20 minutes with GPU enabled for each model) or to use pretrained versions which are already provided. There are a total of 3 CNNs that require training and although some parameters have been tuned to provide a faster training time (such as steps_per_epoch and validation_steps which have been heavily lowered) this may result in a long time spent running this lab rather than thinking about what you observe.

To speed things up we have provided saved pre-trained versions of each model along with their respective training history. We recommend you use these pre-trained versions to save time. However we also consider that training a model is an important learning experience especially if you haven't done this before. If you want to perform this training by yourself, the code for replicating the training is provided as well. In this case the GPU is absolutely necessary, so be sure that it is enabled.

To make sure your runtime is GPU you can go to Runtime -> Change runtime type -> Select GPU from the menu and then press SAVE

If you decide to use the pretrained versions make sure you are not using a GPU as it is not required and may prevent other users from getting access to one. To check this, go to Runtime -> Change runtime type -> Select None from the menu and then press SAVE.

NOTE 2:

Colab does not guarantee access to a GPU. This depends on the availability of these resources. However it is not very common to be denied GPU access. If this happens to you, you can still run this lab without training the models yourself. If you really want to do the training but are denied a GPU, try switching the runtime to a GPU after a couple of hours.

To know more about Colab's policies check out this FAQ.



Let's get started!

Before you move on, download the two datasets used in the lab, as well as the pretrained models and histories:

A story of data

To guide you through this lab we have prepared a narrative that simulates a real life scenario:

Suppose you have been tasked to create a model that classifies images of cats, dogs and birds. For this you settle on a simple CNN architecture, since CNN's are known to perform well for image classification. You are probably familiar with two widely used datasets: cats vs dogs, and caltech birds. As a side note both datasets are available through Tensforflow Datasets (TFDS). However, you decide NOT to use TFDS since the lab requires you to modify the data and combine the two datasets into one.

Combining the datasets

The raw images in these datasets can be found within the following paths:

The next step is extracting the data into a directory of choice, base_dir in this case.

Note that the cats vs dogs images are in zip file format while the caltech birds images come in a tar file.

For the cats and dogs images no further preprocessing is needed as all exemplars of a single class are located in one directory: PetImages\Cat and PetImages\Dog respectively. Let's check how many images are available for each category:

The Bird images dataset organization is quite different. This dataset is commonly used to classify species of birds so there is a directory for each species. Let's treat all species of birds as a single class. This requires moving all bird images to a single directory (PetImages/Bird will be used for consistency). This can be done by running the next cell:

It turns out that there is a similar number of images for each class you are trying to predict! Nice!

Let's take a quick look at an image of each class you are trying to predict.

Train / Evaluate Split

Before training the model you need to split the data into training and evaluating sets. For training, we have chosen the Keras application programming interface (API) which includes functionality to read images from various directories. The easier way to split the data is to create a different directory for each split of each class.

Run the next cell to create the directories for training and evaluating sets.

Now, let's define a function that will move a percentage of images from an origin folder to a destination folder as desired to generate the training and evaluation splits:

And now you are ready to call the previous function and split the data:

Something important to mention is that as it currently stands your dataset has some issues that will prevent model training and evaluation. Mainly:

  1. Some images are corrupted and have zero bytes.
  2. Cats vs dogs zip file included a .db file for each class that needs to be deleted.

If you didn't fix this before training you will get errors regarding these issues and training will fail. Zero-byte images are not valid images and Keras will let you know once these files are reached. In a similar way .db files are not valid images. It is a good practice to always make sure that you are submitting files with the correct specifications to your training algorithm before start running it as these issues might not be encountered right away and you will have to solve them and start training again.

Running the following bash commands in the base directory will resolve these issues:

The first command removes all zero-byte files from the filesystem. The second one removes any file that does not have a .jpg extension.

This also serves as a reminder of the power of bash. Although you could achieve the same result with Python code, bash allows you to do this much quicker. If you are not familiar with bash or some other shell-like language we encourage you to learn some of it as it is a very useful tool for data manipulation purposes.

Let's check how many images you have available for each split and class after you remove the corrupted images:

It turns out that very few files presented the issues mentioned above. That's good news but it is also a reminder that small problems with the dataset might unexpectedly affect the training process. In this case, 4 non valid image files will have prevented you from training the model.

In most cases training Deep Learning models is a time intensive task, so be sure to have everything in place before starting this process.

An unexpected issue!

Let's face the first real life issue in this narrative! There was a power outage in your office and some hard drives were damaged and as a result of that, many of the images for dogs and birds have been erased. As a matter of fact, only 20% of the dog images and 10% of the bird images survived.

To simulate this scenario, let's quickly create a new directory called imbalanced and copy only the proportions mentioned above for each class.

For now there is no quick or clear solution to the accidental file loss. So you decide to keep going and train the model with the remaining images.

Selecting the model

Let's go ahead and create a model architecture and define a loss function, optimizer and performance metrics leveraging keras API:

And let's print out a model summary as a quick check.

For training the model you will be using Keras' ImageDataGenerator, which has built-in functionalities to easily feed your model with raw, rescaled or even augmented image data.

Another cool functionality within ImageDataGenerator is the flow_from_directory method which allows to read images as needed from a root directory. This method needs the following arguments:

If you want to learn more about using Keras' ImageDataGenerator, check this tutorial.

Let's do a quick sanity check to inspect that both generators (training and validation) use the same labels for each class:

Training a CNN with class imbalanced data

To analyze the model performance properly, it is important to track different metrics such as accuracy and loss function along the training process. Let's define a helper function to handle the metrics through the training history,depending on the method you previously selected:

Now, let's plot the metrics and losses for each training epoch as the training process progresses.

From these two plots is quite evident that the model is overfitting the training data. However, the evaluation accuracy is still pretty high. Maybe class imbalance is not such a big issue after all. Perhaps this is too good to be true.

Let's dive a little deeper, and compute some additional metrics to explore if the class imbalance is hampering the model to perform well. In particular, let's compare: the accuracy score, the accuracy score balanced, and the confusion matrix. Information on the accuracy scores calculations is provided in the sklearn documentation. To refresh ideas on what is a confusion matrix check Wikipedia.

Comparing the accuracy and balanced accuracy metrics, the class imbalance starts to become apparent. Now let's compute the confusion matrix of the predictions. Notice that the class imbalance is also present in the evaluation set so the confusion matrix will show an overwhelming majority for cats.

Class imbalance is a real problem that if not detected early on, gives the wrong impression that your model is performing better than it actually is. For this reason, is important to rely on several metrics that do a better job at capturing these kinds of issues. In this case the standard accuracy metric is misleading and provides a false sense that the model is performing better than it actually is.

To prove this point further consider a model that only predicts cats:

If you only look at the accuracy metric the model seems to be working fairly well, since the majority class is the same that the model always predicts.

There are several techniques to deal with class imbalance. A very popular one is SMOTE, which oversamples the minority classes by creating syntethic data. However, these techniques are outside the scope of this lab.

The previous metrics were computed with class imbalance both on the training and evaluation sets. If you are wondering how the model performed with class imbalance only on the training set run the following cell to see the confusion matrix with balanced classes in the evaluation set:

Training with the complete dataset

For the time being and following the narrative, assume that a colleague of yours was careful enough to save a backup of the complete dataset in her cloud storage. Now you can try training without the class imbalance issue, what a relief!

Now that you have the complete dataset it is time to try again without suffering from class imbalance. In general, collecting more data is beneficial for models!

Let's check how the accuracy vs balanced accuracy comparison looks like now:

Both accuracy-based metrics are very similar now. The confusion matrix also looks way better than before. This suggests that class imbalance has been successfully mitigated by adding more data to the previously undersampled classes.

Now that you now that you can trust the accuracy metric, let's plot the training history:

This looks much better than for the imbalanced case! However, overfitting is still present.

Can you think of ways to address this issue? If you are familiar with CNN's you might think of adding dropout layers. This intuition is correct but for the time being you decide to stick with the same model and only change the data to see if it is possible to mitigate overfitting in this manner.

Another possible solution is to apply data augmentation techniques. Your whole team agrees this is the way to go so you decide to try this next!

Training with Data Augmentation

Augmenting images is a technique in which you create new versions of the images you have at hand, by applying geometric transformations. These transformations can vary from: zooming in and out, rotating, or even flipping the images. By doing this, you get a training dataset that exposes the model to a wider variety of images. This helps in further exploring the feature space and hence reducing the chances of overfitting.

It is also a very natural idea since doing slight (or sometimes not so slight) changes to an image will result in an equally valid image. A cat sitting in an awkward position is still a cat, right?

Notice that the only difference with the previous training is that the ImageDataGenerator object now has some extra parameters. We encourage you to read more about this topic here if you haven't already. Also this was only done to the training generator since this technique should only be applied to the training images.

But what exactly are these extra parameters doing?

Let's see these transformations in action. The following cell applies and displays different transformations for a single image:

Let's look at another more extreme example:

Feel free to try your own custom ImageDataGenerators! The results can be very fun to watch. If you check the docs there are some other parameters you may want to toy with.

Now that you know what data augmentation is doing to the training images let's move onto training:

Since you know that class imbalance is no longer an issue there is no need to check for more in-depth metrics.

Let's plot the training history right away:

Now, the evaluation accuracy follows more closely the training one. This indicates that the model is no longer overfitting. Quite a remarkable finding, achieved by just augmenting the data set. Another option to handle overfitting is to include dropout layers in your model as mentioned earlier.

Another point worth mentioning, is that this model achieves a slightly lower evaluation accuracy when compared to the model without data augmentation. The reason for this, is that this model needs more epochs to train. To spot this issue, check that for the model without data augmentation, the training accuracy reached almost 100%, whereas the augmented one can still improve.

Wrapping it up

Congratulations on finishing this ungraded lab!

It is quite amazing to see how data alone can impact Deep Learning models. Hopefully this lab helped you have a better understanding of the importance of data.

In particular, you figured out ways to diagnose the effects of class imbalance and looked at specific metrics to spot this problem. Adding more data is a simple way to overcome class imbalance. However, this is not always feasible in a real life scenario.

In the final section, you applied multiple geometric transformations to the images in the training dataset, to generate an augmented version. The goal was to use data augmentation to reduce overfitting. Changing the network architecture is an alternative method to reduce overfitting. In practice, it is a good idea to implement both techniques for better results.

Keep it up!