Ungraded Lab: Quantization and Pruning

In this lab, you will get some hands-on practice with the mobile optimization techniques discussed in the lectures. These enable reduced model size and latency which makes it ideal for edge and IOT devices. You will start by training a Keras model then compare its model size and accuracy after going through these techniques:

Let's begin!

Imports

Let's first import a few common libraries that you'll be using throughout the notebook.

Utilities and constants

Let's first define a few string constants and utility functions to make our code easier to maintain.

Download and Prepare the Dataset

You will be using the MNIST dataset which is hosted in Keras Datasets. Some of the helper files in this notebook are made to work with this dataset so if you decide to switch to a different dataset, make sure to check if those helper functions need to be modified (e.g. shape of the Flatten layer in your model).

Baseline Model

You will first build and train a Keras model. This will be the baseline where you will be comparing the mobile optimized versions later on. This will just be a shallow CNN with a softmax output to classify a given MNIST digit. You can review the model_builder() function in the utilities at the top of this notebook but we also printed the model summary below to show the architecture.

You will also save the weights so you can reinitialize the other models later the same way. This is not needed in real projects but for this demo notebook, it would be good to have the same initial state later so you can compare the effects of the optimizations.

You can then compile and train the model. In practice, it's best to shuffle the train set but for this demo, it is set to False for reproducibility of the results. One epoch below will reach around 91% accuracy.

Let's save the accuracy of the model against the test set so you can compare later.

Next, you will save the Keras model as a file and record its size as well.

Convert the model to TF Lite format

Next, you will convert the model to Tensorflow Lite (TF Lite) format. This is designed to make Tensorflow models more efficient and lightweight when running on mobile, embedded, and IOT devices.

You can convert a Keras model with TF Lite's Converter class and we've incorporated it in the short helper function below. Notice that there is a quantize flag which you can use to quantize the model.

You will use the helper function to convert the Keras model then get its size and accuracy. Take note that this is not yet quantized.

You will notice that there is already a slight decrease in model size when converting to .tflite format.

The accuracy will also be nearly identical when converting between formats. You can setup a TF Lite model for input-output using its Interpreter class. This is shown in the evaluate_tflite_model() helper function provided in the Utilities section earlier.

Note: If you see a Runtime Error: There is at least 1 reference to internal data in the interpreter in the form of a numpy array or slice. , please try re-running the cell.

Post-Training Quantization

Now that you have the baseline metrics, you can now observe the effects of quantization. As mentioned in the lectures, this process involves converting floating point representations into integer to reduce model size and achieve faster computation.

As shown in the convert_tflite() helper function earlier, you can easily do post-training quantization with the TF Lite API. You just need to set the converter optimization and assign an Optimize Enum.

You will set the quantize flag to do that and get the metrics again.

You should see around a 4X reduction in model size in the quantized version. This comes from converting the 32 bit representations (float) into 8 bits (integer).

As mentioned in the lecture, you can expect the accuracy to not be the same when quantizing the model. Most of the time it will decrease but in some cases, it can even increase. Again, this can be attributed to the loss of precision when you remove the extra bits from the float data.

Quantization Aware Training

When post-training quantization results in loss of accuracy that is unacceptable for your application, you can consider doing quantization aware training before quantizing the model. This simulates the loss of precision by inserting fake quant nodes in the model during training. That way, your model will learn to adapt with the loss of precision to get more accurate predictions.

The Tensorflow Model Optimization Toolkit provides a quantize_model() method to do this quickly and you will see that below. But first, let's install the toolkit into the notebook environment.

You will build the baseline model again but this time, you will pass it into the quantize_model() method to indicate quantization aware training.

Take note that in case you decide to pass in a model that is already trained, then make sure to recompile before you continue training.

You may have noticed a slight difference in the model summary above compared to the baseline model summary in the earlier sections. The total params count increased as expected because of the nodes added by the quantize_model() method.

With that, you can now train the model. You will notice that the accuracy is a bit lower because the model is simulating the loss of precision. The training will take a bit longer if you want to achieve the same training accuracy as the earlier run. For this exercise though, we will keep to 1 epoch.

You can then get the accuracy of the Keras model before and after quantizing the model. The accuracy is expected to be nearly identical because the model is trained to counter the effects of quantization.

Pruning

Let's now move on to another technique for reducing model size: Pruning. This process involves zeroing out insignificant (i.e. low magnitude) weights. The intuition is these weights do not contribute as much to making predictions so you can remove them and get the same result. Making the weights sparse helps in compressing the model more efficiently and you will see that in this section.

The Tensorflow Model Optimization Toolkit again has a convenience method for this. The prune_low_magnitude() method puts wrappers in a Keras model so it can be pruned during training. You will pass in the baseline model that you already trained earlier. You will notice that the model summary show increased params because of the wrapper layers added by the pruning method.

You can set how the pruning is done during training. Below, you will use PolynomialDecay to indicate how the sparsity ramps up with each step. Another option available in the library is Constant Sparsity.

You can also peek at the weights of one of the layers in your model. After pruning, you will notice that many of these will be zeroed out.

With that, you can now start re-training the model. Take note that the UpdatePruningStep() callback is required.

Now see how the weights in the same layer looks like after pruning.

After pruning, you can remove the wrapper layers to have the same layers and params as the baseline model. You can do that with the strip_pruning() method as shown below. You will do this so you can save the model and also export to TF Lite format just like in the previous sections.

You will see the same model weights but the index is different because the wrappers were removed.

You will notice below that the pruned model will have the same file size as the baseline_model when saved as H5. This is to be expected. The improvement will be noticeable when you compress the model as will be shown in the cell after this.

You will use the get_gzipped_model_size() helper function in the Utilities to compress the models and get its resulting file size. You will notice that the pruned model is about 3 times smaller. This is because of the sparse weights generated by the pruning process. The zeros can be compressed much more efficiently than the low magnitude weights before pruning.

You can make the model even more lightweight by quantizing the pruned model. This achieves around 10X reduction in compressed model size as compared to the baseline.

As expected, the TF Lite model's accuracy will also be close to the Keras model.

Wrap Up

In this notebook, you practiced several techniques in optimizing your models for mobile and embedded applications. You used quantization to reduce floating point representations into integer, then used pruning to make the weights sparse for efficient model compression. These make your models lightweight for efficient transport and storage without sacrificing model accuracy. Try this in your own models and see what performance you get. For more information, here are a few other resources:

Congratulations and enjoy the rest of the course!