Ungraded Lab: TensorFlow Model Analysis

In production systems, the decision to deploy a model usually goes beyond the global metrics (e.g. accuracy) set during training. It is also important to evaluate how your model performs in different scenarios. For instance, does your weather forecasting model perform equally well in summer compared to winter? Or does your camera-based defect detector work only in certain lighting conditions? This type of investigation helps to ensure that your model can handle different cases. More than that, it can help uncover any learned biases that can result in a negative experience for your users. For example, if you're supposed to have a gender-neutral application, you don't want your model to only work well for one while poorly for another.

In this lab, you will be working with TensorFlow Model Analysis (TFMA) -- a library built specifically for analyzing a model's performance across different configurations. It allows you to specify slices of your data, then it will compute and visualize how your model performs on each slice. You can also set thresholds that your model must meet before it is marked ready for deployment. These help you make better decisions regarding any improvements you may want to make to boost your model's performance and ensure fairness.

For this exercise, you will use TFMA to analyze models trained on the Census Income dataset. Specifically, you will:

Credits: Some of the code and discussions are based on the TensorFlow team's official tutorial.

Setup

In this section, you will first setup your workspace to have all the modules and files to work with TFMA. You will

Install Jupyter Extensions

If running in a local Jupyter notebook, then these Jupyter extensions must be installed in the environment before running Jupyter. These are already available in Colab so we'll just leave the commands here for reference.

jupyter nbextension enable --py widgetsnbextension --sys-prefix 
jupyter nbextension install --py --symlink tensorflow_model_analysis --sys-prefix 
jupyter nbextension enable --py tensorflow_model_analysis --sys-prefix

Install libraries

This will pull in all the dependencies and will take 6 to 8 minutes to complete.

Note: In Google Colab, you need to restart the runtime at this point to finalize updating the packages you just installed. Please do not proceed to the next section without restarting. You can also ignore the errors about version incompatibility of some of the bundled packages because we won't be using those in this notebook.

Check the installation

Running the code below should show the versions of the packages. Please re-run the install if you are seeing errors and don't forget to restart the runtime after re-installation.

Load the files

Next, you will download the files you will need for this exercise:

We've also defined some global variables below so you can access these files throughout the notebook more easily.

You can see the top level file and directories by running the cell below (or just using the file explorer on the left side of this Colab). We'll discuss what each contain in the next sections.

Preview the dataset

The data/csv directory contains the test split of the Census Income dataset. We've divided it into several files for this demo notebook:

You can see the description of each column here (please open link in a new window if Colab prevents the download). Also for simplicity, we've already preprocessed the label column as binary (i.e. 0 or 1) to match the model's output. In your own projects, your labels might be in a different data type (e.g. string) and you want to transform that first so you can evaluate your model properly. You can preview the first few rows below:

Parse the Schema

You also downloaded a schema generated by TensorFlow Data Validation. You should be familiar with this file type already from previous courses. You will load it now so you can use it in the later parts of the notebook.

Use the Schema to Create TFRecords

TFMA needs a TFRecord file input so you need to convert the CSVs in the data directory. If you've done the earlier labs, you will know that this can be easily done with ExampleGen. For this notebook however, you will use the helper function below instead to demonstrate how it can be done outside a TFX pipeline. You will pass in the schema you loaded in the previous step to determine the correct type of each feature.

The code below will do the conversion and we've defined some more global variables that you will use in later exercises.

Pretrained models

Lastly, you also downloaded pretrained Keras models and they are stored in the models/ directory. TFMA supports a number of different model types including TF Keras models, models based on generic TF2 signature APIs, as well TF estimator based models. The get_started guide has the full list of model types supported and any restrictions. You can also consult the FAQ for examples on how to configure these models.

We have included three models and you can choose to analyze any one of them in the later sections. These were saved in SavedModel format which is the default when saving with the Keras Models API.

As mentioned earlier, these models were trained on the Census Income dataset. The label is 1 if a person earns more than 50k USD and 0 if less than or equal. You can load one of the models and look at the summary to get a sense of its architecture. All three models use the same architecture but were trained with different epochs to simulate varying performance.

You can see the code to build these in the next lab. For now, you'll only need to take note of a few things. First, the output is a single dense unit with a sigmoid activation (i.e. dense_5 above). This is standard for binary classification problems.

Another is that the model is exported with a transformation layer. You can see this in the summary above at the bottom row named transform_features_layer and it is not connected to the other layers. From previous labs, you will know that this is taken from the graph generated by the Transform component. It helps to avoid training-serving skews by making sure that raw inputs are transformed in the same way that the model expects. It is also available as a tft_layer property of the model object.

TFMA invokes this layer automatically for your raw inputs but we've included a short snippet below to demonstrate how the transformation works. You can see that the raw features are indeed reformatted to an acceptable input for the model. The raw numeric features are scaled and the raw categorical (string) features are encoded to one-hot vectors.

The transformed features can be passed into the model to get the predictions. The snippet below demonstrates this and we used a low-threshold BinaryAccuracy metric to compare the true labels and model predictions.

Last thing to note is the model is also exported with a serving signature. You will know more about this in the next lab and in later parts of the specialization but for now, you can think of it as a configuration for when the model is deployed for inference. For this particular model, the default signature is configured to transform batches of serialized raw features before feeding to the model input. That way, you wouldn't have to explicitly code the transformations as previously shown in the snippet above. You can just pass in batches of data directly as shown below.

TFMA accesses this model signature so it can work with the raw data and evaluate the model to get the metrics. Not only that, it can also extract specific features and domain values from your dataset before it computes these metrics. Let's see how this is done in the next section.

Setup and Run TFMA

With the dataset and model now available, you can now move on to use TFMA. There are some additional steps needed:

Create EvalConfig

The tfma.EvalConfig() is a protocol message that sets up the analysis. Here, you will specify:

The eval config should be passed as a protocol message and you can use the google.protobuf.text_format module for that as shown below.

Create EvalSharedModel

TFMA also requires an EvalSharedModel instance that points to your model so it can be shared between multiple threads in the same process. This instance includes information about the type of model (keras, etc) and how to load and configure the model from its saved location on disk (e.g. tags, etc). The tfma.default_eval_shared_model() API can be used to create this given the model location and eval config.

Run TFMA

With the setup complete, you just need to declare an output directory then run TFMA. You will pass in the eval config, shared model, dataset, and output directory to tfma.run_model_analysis() as shown below. This will create a tfma.EvalResult which you can use later for rendering metrics and plots.

Visualizing Metrics and Plots

You can visualize the results also using TFMA methods. In this section, you will view the returned metrics and plots for the different slices you specified in the eval config.

Rendering Metrics

You can view the metrics with the tfma.view.render_slicing_metrics() method. By default, the views will display the Overall slice. To view a particular slice you can pass in a feature name to the slicing_column argument as shown below. You can visualize the different metrics through the Show dropdown menu and you can hover over the bar charts to show the exact value measured.

We encourage you to try the different options you see and also modify the command. Here are some examples:

More Slices

If you haven't yet, you can also pass in the native-country to the slicing column. The difference in this visualization is we only specified two of its values in the eval config earlier. This is useful if you just want to study a subgroup of a particular feature and not the entire domain.

TFMA also supports creating feature crosses to analyze combinations of features. Our original settings created a cross between sex and race and you can pass it in as a SlicingSpec as shown below.

In some cases, crossing the two columns creates a lot of combinations. You can narrow down the results to only look at specific values by specifying it in the slicing_spec argument. Below shows the results for the sex feature for the Other race.

Rendering Plots

Any plots that were added to the tfma.EvalConfig as post training metric_specs can be displayed using tfma.view.render_plot.

As with metrics, plots can be viewed by slice. Unlike metrics, only plots for a particular slice value can be displayed so the tfma.SlicingSpec must be used and it must specify both a slice feature name and value. If no slice is provided then the plots for the Overall slice is used.

The example below displays the plots that were computed for the sex:Male slice. You can click on the names at the bottom of the graph to see a different plot type. Alternatively, you can tick the Show all plots checkbox to show all the plots in one screen.

Tracking Model Performance Over Time

Your training dataset will be used for training your model, and will hopefully be representative of your test dataset and the data that will be sent to your model in production. However, while the data in inference requests may remain the same as your training data, it can also start to change enough so that the performance of your model will change. That means that you need to monitor and measure your model's performance on an ongoing basis so that you can be aware of and react to changes.

Let's take a look at how TFMA can help. You will load three different datasets and compare the model analysis results using the render_time_series() method.

First, imagine that you've trained and deployed your model yesterday. And now, you want to see how it's doing on the new data coming in today. The visualization will start by displaying AUC. From the UI, you can:

Note: In the metric series charts, the x-axis is just the model directory name of the model that you're examining.

Now imagine that another day has passed and you want to see how it's doing on the new data coming in today.

This type of investigation lets you see if your model is behaving poorly on new data. You can make the decision to retrain your production model based on these results. Retraining might not always produce the best results and you also need a way to detect that. You will see how TFMA helps in that regard in the next section.

Model Validation

TFMA can be configured to evaluate multiple models at the same time. Typically, this is done to compare a candidate model against a baseline (such as the currently serving model) to determine what the performance differences in metrics are. When thresholds are configured, TFMA will produce a tfma.ValidationResult record indicating whether the performance matches expecations.

Below, you will re-configure the EvalConfig settings to compare two models: a candidate and a baseline. You will also validate the candidate's performance against the baseline by setting a tmfa.MetricThreshold on the BinaryAccuracy metric. This helps in determining if your new model can indeed replace your currently deployed model.

When running evaluations with one or more models against a baseline, TFMA automatically adds different metrics for all the metrics computed during the evaluation. These metrics are named after the corresponding metric but with the string _diff appended to the metric name. A positive value for these _diff metrics indicates an improved performance against the baseline.

Like in the previous section, you can view the results with render_time_series().

You can use tfma.load_validator_result to view the validation results you specified with the threshold settings. For this example, the validation fails because BinaryAccuracy is below the threshold.

Congratulations! You have now explored the different methods of model analysis using TFMA. In the next section, you will see how these can fit into a TFX pipeline so you can automate the process and store the results in your pipeline directory and metadata store.