Ungraded Lab: Iterative Schema with TFX and ML Metadata

In this notebook, you will get to review how to update an inferred schema and save the result to the metadata store used by TFX. As mentioned before, the TFX components get information from this database before running executions. Thus, if you will be curating a schema, you will need to save this as an artifact in the metadata store. You will get to see how that is done in the following exercise.

Afterwards, you will also practice accessing the TFX metadata store and see how you can track the lineage of an artifact.

Setup

Imports

Define paths

For familiarity, you will again be using the Census Income dataset from the previous weeks' ungraded labs. You will use the same paths to your raw data and pipeline files as shown below.

Data Pipeline

Each TFX component you use accepts and generates artifacts which are instances of the different artifact types TFX has configured in the metadata store. The properties of these instances are shown neatly in a table in the outputs of context.run(). TFX does all of these for you so you only need to inspect the output of each component to know which property of the artifact you can pass on to the next component (e.g. the outputs['examples'] of ExampleGen can be passed to StatisticsGen).

Since you've already used this dataset before, we will just quickly go over ExampleGen, StatisticsGen, and SchemaGen. The new concepts will be discussed after the said components.

Create the Interactive Context

ExampleGen

StatisticsGen

SchemaGen

Curating the Schema

Now that you have the inferred schema, you can proceed to revising it to be more robust. For instance, you can restrict the age as you did in Week 1. First, you have to load the Schema protocol buffer from the metadata store. You can do this by getting the schema uri from the output of SchemaGen then use TFDV's load_schema_text() method.

With that, you can now make changes to the schema as before. For the purpose of this exercise, you will only modify the age domain but feel free to add more if you want.

Schema Environments

By default, your schema will watch for all the features declared above including the label. However, when the model is served for inference, it will get datasets that will not have the label because that is the feature that the model will be trying to predict. You need to configure the pipeline to not raise an alarm when this kind of dataset is received.

You can do that with schema environments. First, you will need to declare training and serving environments, then configure the serving schema to not watch for the presence of labels. See how it is implemented below.

You can now freeze the curated schema and save to a local directory.

ImportSchemaGen

Now that the schema has been saved, you need to create an artifact in the metadata store that will point to it. TFX provides the ImporterSchemaGen component used to import a curated schema to ML Metadata. You simply need to specify the URL of the revised schema file.

If you pass in the component output to context.show(), then you should see the schema.

ExampleValidator

You can then use this new artifact as input to the other components of the pipeline. See how it is used as the schema argument in ExampleValidator below.

Practice with ML Metadata

At this point, you should now take some time exploring the contents of the metadata store saved by your component runs. This will let you practice tracking artifacts and how they are related to each other. This involves looking at artifacts, executions, and events. This skill will let you recover related artifacts even without seeing the code of the training run. All you need is access to the metadata store.

See how the input artifact IDs to an instance of ExampleAnomalies are tracked in the following cells. If you have this notebook, then you will already know that it uses the output of StatisticsGen for this run and also the curated schema you imported. However, if you already have hundreds of training runs and parameter iterations, you may find it hard to track which is which. That's where the metadata store will be useful. Since it records information about a specific pipeline run, you will be able to track the inputs and outputs of a particular artifact.

You will start by setting the connection config to the metadata store.

Next, let's see what artifact types are available in the metadata store.

If you get the artifacts of type Schema, you will see that there are two entries. One is the inferred and the other is the one you imported. At the end of this exercise, you can verify that the curated schema is the one used for the ExampleValidator run we will be investigating.

Let's get the first instance of ExampleAnomalies to get the output of ExampleValidator.

You will use the artifact ID to get events related to it. Let's just get the first instance.

As expected, the event type will be an OUTPUT because this is the output of the ExampleValidator component. Since we want to get the inputs, we can track it through the execution id.

We see the artifacts which are marked as INPUT above representing the statistics and schema inputs. We can extract their IDs programmatically like this. You will see that you will get the artifact ID of the curated schema you printed out earlier.

Congratulations! You have now completed this notebook on iterative schemas and saw how it can be used in a TFX pipeline. You were also able to track an artifact's lineage by looking at the artifacts, events, and executions in the metadata store. These will come in handy in this week's assignment!