Ungraded Lab: Fairness Indicators

In this colab notebook, you will use Fairness Indicators to explore the Smiling attribute in a large-scale face image dataset. Fairness Indicators is a suite of tools built on top of TensorFlow Model Analysis that enable regular evaluation of fairness metrics in product pipelines. This Introductory Video provides more details and context on the real-world scenario presented here, one of primary motivations for creating Fairness Indicators. This notebook will teach you to:

Credits: Some of the code and discussions are taken from this Tensorflow tutorial.

Install Fairness Indicators

This will install all related libraries such as TFMA and TFDV.

Note: In Google Colab, you need to restart the runtime at this point to finalize updating the packages you just installed. You can do so by clicking the Restart Runtime at the end of the output cell above (after installation), or by selecting Runtime > Restart Runtime in the Menu bar. Please do not proceed to the next section without restarting. You can also ignore the errors about version incompatibility of some of the bundled packages because we won't be using those in this notebook.

Import packages

Next, you will import the main packages and some utilities you will need in this notebook. Notice that you are not importing fairness-indicators directly. As mentioned in the intro, this suite of tools is built on top of TFMA so you can just import TFMA to access it.

The code below should not produce any error. Otherwise, please restart the installation.

Download and prepare the dataset

CelebA is a large-scale face attributes dataset with more than 200,000 celebrity images, each with 40 attribute annotations (such as hair type, fashion accessories, facial features, etc.) and 5 landmark locations (eyes, mouth and nose positions). For more details, you can read more in this paper.

With the permission of the owners, this dataset is stored on Google Cloud Storage (GCS) and mostly accessed via TensorFlow Datasets(tfds). To save on download time and disk space, you will use the GCS bucket specified below as your data directory. This already contains the TFRecords. If you want to download it to your workspace, you can pass a local directory to the data_dir argument. Just take note that it will take time to complete the download.

You can preview some of the images in the dataset.

You can also view the dataset as a dataframe to preview the other attributes in tabular format.

Let's list the column header so you can see the attribute names in the dataset. For this notebook, you will just examine the attributes/Young and attributes/Smiling features but feel free to pick other features once you've gone over the whole exercise.

In this notebook:


* While there is little information available about the labeling methodology for this dataset, you will assume that the "Smiling" attribute was determined by a pleased, kind, or amused expression on the subject's face. For the purpose of this example, you will take these labels as ground truth.

Caveats

Before moving forward, there are several considerations to keep in mind when using CelebA:

Setting Up Input Functions

Now, you will define the preprocessing functions to prepare your data as model inputs. These include resizing images, normalizing pixels, casting to the right data type, and grouping the features and labels.

Prepare train and test splits

This next helper function will help split, shuffle, batch and preprocess your training data. For this notebook, you will just develop a model that accepts the image as input and outputs the Smiling attribute (i.e. label).

The test split does not need to be shuffled so you can just preprocess it like below.

As a sanity check, you can examine the contents of a one example in the test data. You should see that it is successfully reshaped and the pixels should be normalized.

Build a simple DNN Model

With the dataset prepared, you will now assemble a simple tf.keras.Sequential model to classify your images. The model consists of:

  1. An input layer that represents the flattened 28x28x3 image.
  2. A fully connected layer with 64 units activated by a ReLU function.
  3. A single-unit readout layer to output real-scores instead of probabilities.

You may be able to greatly improve model performance by adding some complexity (e.g., more densely-connected layers, exploring different activation functions, increasing image size), but that may distract from the goal of demonstrating how easy it is to apply the indicators when working with Keras. For that reason, you will first keep the model simple — but feel free to explore this space later.

Train & Evaluate Model

You’re now ready to train your model. To cut back on the amount of execution time and memory, you will train the model by slicing the data into small batches with only a few repeated iterations.

Evaluating the model on the test data should result in a final accuracy score of just over 85%. Not bad for a simple model with no fine tuning.

You will then save the model so you can analyze it in the next section.

Model Analysis

As you already know, it is usually not enough to just measure your model's performance on global metrics. For instance, performance evaluated across age groups may reveal some shortcomings.

To explore this further, you will evaluate the model with Fairness Indicators via TFMA. In particular, you will see whether there is a significant gap in performance between "Young" and "Not Young" categories when evaluated on false positive rate (FPR).

A false positive error occurs when the model incorrectly predicts the positive class. In this context, a false positive outcome occurs when the ground truth is an image of a celebrity 'Not Smiling' and the model predicts 'Smiling'. While this seems like a relatively mundane error, false positive errors can sometimes cause more problematic behaviors when deployed in a real world application. For instance, a false positive error in a spam classifier could cause a user to miss an important email.

You will mostly follow the same steps as you did in the first ungraded lab of this week. Namely, you will:

Create TFRecord

You will need to serialize the preprocessed test dataset so it can be read by TFMA. We've provided a helper function to do just that. Notice that the age group feature is transformed into a string ('Young' or 'Not Young'). This will come in handy in the visualization so the tags are easier to interpret (compared to just 1 or 0).

Write EvalConfig file

Next, you will define the model, metrics, and slicing specs in an eval config file. As mentioned, you will slice the data across age groups to see if there is an underlying problem. For metrics, you will include the FairnessIndicators class. These are commonly-identified fairness metrics for binary and multiclass classifiers. Moreover, you will configure a list of thresholds. These will allow you to observe if the model predicts better when the threshold to determine between the two classes is changed (e.g. will the FPR be lower if the model predicts "Smiling" for outputs greater than 0.22?).

Create EvalSharedModel

This will be identical to the command you ran in an earlier lab. This is needed so TFMA will know how to load and configure your model from disk.

Create a Schema

This is an additional step from your previous TFMA workflow. It is needed particularly because, unlike the TFMA ungraded lab, you didn't include a serving signature with the model. If you remember, the function called by that signature took care of parsing the tfrecords, converting them to the correct data type, and preprocessing. Since that part is not included in this lab, you will need to provide a schema so TFMA will know what data types are in the serialized examples when it parses the tfrecord into a dictionary of features. You will also need to define the dimensions of the image since that is expected by your model input. That is handled by the tensor_representation_group below.

Run TFMA

You will pass the objects you created in the previous sections to tfma.run_model_analysis(). As you've done previously, this will take care of loading the model and data, and computing the metrics on the data slices you specified.

Now you can view the fairness metrics you specified. The FPR should already be selected and you can see that it is considerably higher for the Not Young age group. Try to explore the widget and see if you can make other findings. Here are some suggestions:

After studying the discrepancies in your predictions, you can then investigate why that happens and have a plan on remidiating it. Aside from changing your model architecture, you can also look first at your training data. fairness-indicators is also packaged with TFDV so you can use it to generate statistics from your data. Here is a short review on how to do that.

First, you will download the dataset from a GCS bucket into your local workspace. You can use the gsutil tool to help with that.

Now you can generate the statistics for a specific feature.

The statistics show that the Not Young age group (i.e. 0 in the attributes/Young column) has very few images compared to the Young age group. Maybe that's why the model learns on the Young images better. You could try adding more Not Young images and see if your model performs better on this slice.

Wrap Up

In this lab, you prepared an image dataset and trained a model to predict one of its attributes (i.e. Smiling). You then sliced the data based on age groups and computed fairness metrics from the Fairness Indicators package via TFMA. Though the outcome looks simple, it is an important step in production ML projects because not detecting these problems can greatly affect the experience of your users. Improving these metrics will help you commit to fairness in your applications. We encourage you to try exploring more slices of the dataset and see what findings you can come up with.

For more practice, here is an official tutorial that uses fairness indicators on text data. It uses the What-If-Tool which is another package that comes with Fairness Indicators. You will also get to explore that in this week's programming assignment.