Ungraded Lab: Model Analysis with TFX Evaluator

Now that you've used TFMA as a standalone library in the previous lab, you will now see how it is used by TFX with its Evaluator component. This component comes after your Trainer run and it checks if your trained model meets the minimum required metrics and also compares it with previously generated models.

You will go through a TFX pipeline that prepares and trains the same model architecture you used in the previous lab. As a reminder, this is a binary classifier to be trained on the Census Income dataset. Since you're already familiar with the earlier TFX components, we will just go over them quickly but we've placed notes on where you can modify code if you want to practice or produce a better result.

Let's begin!

Credits: Some of the code and discussions are based on the TensorFlow team's official tutorial.

Setup

Install TFX

Note: In Google Colab, you need to restart the runtime at this point to finalize updating the packages you just installed. You can do so by clicking the Restart Runtime at the end of the output cell above (after installation), or by selecting Runtime > Restart Runtime in the Menu bar. Please do not proceed to the next section without restarting. You can also ignore the errors about version incompatibility of some of the bundled packages because we won't be using those in this notebook.

Imports

Set up pipeline paths

Download and prepare the dataset

Here, you will download the training split of the Census Income Dataset. This is twice as large as the test dataset you used in the previous lab.

Take a quick look at the first few rows.

TFX Pipeline

Create the InteractiveContext

As usual, you will initialize the pipeline and use a local SQLite file for the metadata store.

ExampleGen

You will start by ingesting the data through CsvExampleGen. The code below uses the default 2:1 train-eval split (i.e. 33% of the data goes to eval) but feel free to modify if you want. You can review splitting techniques here.

StatisticsGen

You will then compute the statistics so it can be used by the next components.

You can look at the visualizations below if you want to explore the data some more.

SchemaGen

You can then infer the dataset schema with SchemaGen. This will be used to validate incoming data to ensure that it is formatted correctly.

For simplicity, you will just accept the inferred schema but feel free to modify with the TFDV API if you want.

ExampleValidator

Next, run ExampleValidator to check if there are anomalies in the data.

If you just used the inferred schema, then there should not be any anomalies detected. If you modified the schema, then there might be some results here and you can again use TFDV to modify or relax constraints as needed.

In actual deployments, this component will also help you understand how your data evolves over time and identify data errors. For example, the first batches of data that you get from your users might conform to the schema but it might not be the case after several months. This component will detect that and let you know that your model might need to be updated.

Transform

Now you will perform feature engineering on the training data. As shown when you previewed the CSV earlier, the data is still in raw format and cannot be consumed by the model just yet. The transform code in the following cells will take care of scaling your numeric features and one-hot encoding your categorical variables.

Note: If you're running this exercise for the first time, we advise that you leave the transformation code as is. After you've gone through the entire notebook, then you can modify these for practice and see what results you get. Just make sure that your model builder code in the Trainer component will also reflect those changes if needed. For example, removing a feature here should also remove an input layer for that feature in the model.

Now, we pass in this feature engineering code to the Transform component and run it to transform your data.

You can see a sample result for one row with the code below. Notice that the numeric features are indeed scaled and the categorical features are now one-hot encoded.

As you already know, the Transform component not only outputs the transformed examples but also the transformation graph. This should be used on all inputs when your model is deployed to ensure that it is transformed the same way as your training data. Else, it can produce training-serving skew which leads to noisy predictions.

The Transform component stores related files in its transform_graph output and it would be good to quickly review its contents before we move on to the next component. As shown below, the URI of this output points to a directory containing three subdirectories.

Trainer

Next, you will now build the model to make your predictions. As mentioned earlier, this is a binary classifier where the label is 1 if a person earns more than 50k USD and 0 if less than or equal. The model used here uses the wide and deep model as reference but feel free to modify after you've completed the exercise. Also for simplicity, the hyperparameters (e.g. number of hidden units) have been hardcoded but feel free to use a Tuner component as you did in Week 1 if you want to get some practice.

As a reminder, it is best to start from run_fn() when you're reviewing the module file below. The Trainer component looks for that function first. All other functions defined in the module are just utility functions for run_fn().

Another thing you will notice below is the _get_serve_tf_examples_fn() function. This is tied to the serving_default signature which makes it possible for you to just pass in raw features when the model is served for inference. You have seen this in action in the previous lab. This is done by decorating the enclosing function, serve_tf_examples_fn(), with tf.function. This indicates that the computation will be done by first tracing a Tensorflow graph. You will notice that this function uses model.tft_layer which comes from transform_graph output. Now when you call the .get_concrete_function() method on this tf.function in run_fn(), you are creating the graph that will be used in later computations. This graph is used whenever you pass in an examples argument pointing to a Tensor with tf.string dtype. That matches the format of the serialized batches of data you used in the previous lab.

Now, we pass in this model code to the Trainer component and run it to train the model.

Let's review the outputs of this component. The model output points to the model output itself.

The model_run output acts as the working directory and can be used to output non-model related output (e.g., TensorBoard logs).

Evaluator

The Evaluator component computes model performance metrics over the evaluation set using the TensorFlow Model Analysis library. The Evaluator can also optionally validate that a newly trained model is better than the previous model. This is useful in a production pipeline setting where you may automatically train and validate a model every day.

There's a few steps needed to setup this component and you will do it in the next cells.

Define EvalConfig

First, you will define the EvalConfig message as you did in the previous lab. You can also set thresholds so you can compare subsequent models to it. The module below should look familiar. One minor difference is you don't have to define the candidate and baseline model names in the model_specs. That is automatically detected.

Resolve latest blessed model

If you remember in the last lab, you were able to validate a candidate model against a baseline by modifying the EvalConfig and EvalSharedModel definitions. That is also possible using the Evaluator component and you will see how it is done in this section.

First thing to note is that the Evaluator marks a model as BLESSED if it meets the metrics thresholds you set in the eval config module. You can load the latest blessed model by using the LatestBlessedModelStrategy with the Resolver component. This component takes care of finding the latest blessed model for you so you don't have to remember it manually. The syntax is shown below.

As expected, the search yielded 0 artifacts because you haven't evaluated any models yet. You will run this component again in later parts of this notebook and you will see a different result.

Run the Evaluator component

With the EvalConfig defined and code to load the latest blessed model available, you can proceed to run the Evaluator component.

You will notice that two models are passed into the component. The Trainer output will serve as the candidate model while the output of the Resolver will be the baseline model. While you can indeed run the Evaluator without comparing two models, it would likely be required in production environments so it's best to include it. Since the Resolver doesn't have any results yet, the Evaluator will just mark the candidate model as BLESSED in this run.

Aside from the eval config and models (i.e. Trainer and Resolver output), you will also pass in the raw examples from ExampleGen. By default, the component will look for the eval split of these examples and since you've defined the serving signature, these will be transformed internally before feeding to the model inputs.

Now let's examine the output artifacts of Evaluator.

The blessing output simply states if the candidate model was blessed. The artifact URI will have a BLESSED or NOT_BLESSED file depending on the result. As mentioned earlier, this first run will pass the evaluation because there is no baseline model yet.

The evaluation output, on the other hand, contains the evaluation logs and can be used to visualize the global metrics on the entire evaluation set.

To see the individual slices, you will need to import TFMA and use the commands you learned in the previous lab.

You can also use TFMA to load the validation results as before by specifying the output URI of the evaluation output. This would be more useful if your model was not blessed because you can see the metric failure prompts. Try to simulate this later by training with fewer epochs (or raising the threshold) and see the results you get here.

Now that your Evaluator has finished running, the Resolver component should be able to detect the latest blessed model. Let's run the component again.

You should now see an artifact in the component outputs. You can also get it programmatically as shown below.

Comparing two models

Now let's see how Evaluator compares two models. You will train the same model with more epochs and this should hopefully result in higher accuracy and overall metrics.

You will re-run the evaluator but you will specify the latest trainer output as the candidate model. The baseline is automatically found with the Resolver node.

Depending on the result, the Resolver should reflect the latest blessed model. Since you trained with more epochs, it is most likely that your candidate model will pass the thresholds you set in the eval config. If so, the artifact URI should be different here compared to your earlier runs.

Finally, the evaluation output of the Evaluator component will now be able to produce the diff results you saw in the previous lab. This will signify if the metrics from the candidate model has indeed improved compared to the baseline. Unlike when using TFMA as a standalone library, visualizing this will just show the results for the candidate (i.e. only one row instead of two in the tabular output in the graph below).

Note: You can ignore the warning about failing to find plots.

Congratulations! You can now successfully evaluate your models in a TFX pipeline! This is a critical part of production ML because you want to make sure that subsequent deployments are indeed improving your results. Moreover, you can extract the evaluation results from your pipeline directory for further investigation with TFMA. In the next sections, you will continue to study techniques related to model evaluation and ensuring fairness.