Ungraded Lab: Feature Engineering Pipeline

In this lab, you will continue exploring Tensorflow Transform. This time, it will be in the context of a machine learning (ML) pipeline. In production-grade projects, you want to streamline tasks so you can more easily improve your model or find issues that may arise. Tensorflow Extended (TFX) provides components that work together to execute the most common steps in a machine learning project. If you want to dig deeper into the motivations behind TFX and the need for machine learning pipelines, you can read about it in this paper and in this blog post.

You will build end-to-end pipelines in future courses but for this one, you will only build up to the feature engineering part. Specifically, you will:

If several steps mentioned above sound familiar, it's because the TFX components that deal with data validation and analysis (i.e. StatisticsGen, SchemaGen, ExampleValidator) uses Tensorflow Data Validation (TFDV) under the hood. You're already familiar with this library from the exercises in Week 1 and for this week, you'll see how it fits within an ML pipeline.

The components you will use are the orange boxes highlighted in the figure below:

Setup

Import packages

Let's begin by importing the required packages and modules. In case you want to replicate this in your local workstation, we used Tensorflow v2.6 and TFX v1.3.0.

Define paths

You will define a few global variables to indicate paths in the local workspace.

Preview the dataset

You will again be using the Census Income dataset from the Week 1 ungraded lab so you can compare outputs when just using stand-alone TFDV and when using it under TFX. Just to remind, the data can be used to predict if an individual earns more than or less than 50k US Dollars annually. Here is the description of the features again:

Create the Interactive Context

When pushing to production, you want to automate the pipeline execution using orchestrators such as Apache Beam and Kubeflow. You will not be doing that just yet and will instead execute the pipeline from this notebook. When experimenting in a notebook environment, you will be manually executing the pipeline components (i.e. you are the orchestrator). For that, TFX provides the Interactive Context so you can step through each component and inspect its outputs.

You will initialize the InteractiveContext below. This will create a database in the _pipeline_root directory which the different components will use to save or get the state of the component executions. You will learn more about this in Week 3 when we discuss ML Metadata. For now, you can think of it as the data store that makes it possible for the different pipeline components to work together.

Note: You can configure the database to connect to but for this exercise, we will just use the default which is a newly created local sqlite file. You will see the warning after running the cell below and you can safely ignore it.

Run TFX components interactively

With that, you can now run the pipeline interactively. You will see how to do that as you go through the different components below.

ExampleGen

You will start the pipeline with the ExampleGen component. This will:

Its constructor takes the path to your data source/directory. In our case, this is the _data_root path. The component supports several data sources such as CSV, tf.Record, and BigQuery. Since our data is a CSV file, we will use CsvExampleGen to ingest the data.

Run the cell below to instantiate CsvExampleGen.

You can execute the component by calling the run() method of the InteractiveContext.

You will notice that an output cell showing the execution results is automatically shown. This metadata is recorded into the database created earlier. This allows you to keep track of your project runs. For example, if you run it again, you will notice the .execution_id incrementing.

The output of the components are called artifacts and you can see an example by navigating through .component.outputs > ['examples'] > Channel > ._artifacts > [0] above. It shows information such as where the converted data is stored (.uri) and the splits generated (.split_names).

You can also examine the output artifacts programmatically with the code below.

If you're wondering , the number in ./pipeline/CsvExampleGen/examples/{number} is the execution id associated with that dataset. If you restart the kernel of this workspace and re-run up to this cell, you will notice a new folder with a different id name created. This shows that TFX is keeping versions of your data so you can roll back if you want to investigate a particular execution.

As mentioned, the ingested data is stored in the directory shown in the uri field. It is also compressed using gzip and you can verify by running the cell below.

In a notebook environment, it may be useful to examine a few examples of the data especially if you're still experimenting. Since the data collection is saved in TFRecord format, you will need to use methods that work with that data type. You will need to unpack the individual examples from the TFRecord file and format it for printing. Let's do that in the following cells:

Now that ExampleGen has finished ingesting the data, the next step is data analysis.

StatisticsGen

The StatisticsGen component computes statistics over your dataset for data analysis, as well as for use in downstream components (i.e. next steps in the pipeline). As mentioned earlier, this component uses TFDV under the hood so its output will be familiar to you.

StatisticsGen takes as input the dataset we just ingested using CsvExampleGen.

You can display the statistics with the show() method.

Note: You can safely ignore the warning shown when running the cell below.

SchemaGen

The SchemaGen component also uses TFDV to generate a schema based on your data statistics. As you've learned previously, a schema defines the expected bounds, types, and properties of the features in your dataset.

SchemaGen will take as input the statistics that we generated with StatisticsGen, looking at the training split by default.

You can then visualize the generated schema as a table.

Let's now move to the next step in the pipeline and see if there are any anomalies in the data.

ExampleValidator

The ExampleValidator component detects anomalies in your data based on the generated schema from the previous step. Like the previous two components, it also uses TFDV under the hood.

ExampleValidator will take as input the statistics from StatisticsGen and the schema from SchemaGen. By default, it compares the statistics from the evaluation split to the schema from the training split.

As with the previous component, you can also visualize the anomalies as a table.

With no anomalies detected, you can proceed to the next step in the pipeline.

Transform

The Transform component performs feature engineering for both training and serving datasets. It uses the TensorFlow Transform library introduced in the first ungraded lab of this week.

Transform will take as input the data from ExampleGen, the schema from SchemaGen, as well as a module containing the preprocessing function.

In this section, you will work on an example of a user-defined Transform code. The pipeline needs to load this as a module so you need to use the magic command %% writefile to save the file to disk. Let's first define a few constants that group the data's attributes according to the transforms we will perform later. This file will also be saved locally.

Next, you will work on the module that contains preprocessing_fn(). As you've seen in the previous lab, this function defines how you will transform the raw data into features that your model can train on (i.e. the next step in the pipeline). You will use the tft module functions to make these transformations.

Note: After completing the entire notebook, we encourage you to go back to this section and try different tft functions aside from the ones already provided below. You can also modify the grouping of the feature keys in the constants file if you want. For example, you may want to scale some features to [0, 1] while others are scaled to the z-score. This will be good practice for this week's assignment.

You can now pass the training data, schema, and transform module to the Transform component. You can ignore the warning messages generated by Apache Beam regarding type hints.

Let's examine the output artifacts of Transform (i.e. .component.outputs from the output cell above). This component produces several outputs:

Take a peek at the transform_graph artifact. It points to a directory containing three subdirectories.

You can also take a look at the first three transformed examples using the helper function defined earlier.

Congratulations! You have now executed all the components in our pipeline. You will get hands-on practice as well with training and model evaluation in future courses but for now, we encourage you to try exploring the different components we just discussed. As mentioned earlier, a useful exercise for the upcoming assignment is to be familiar with using different tft functions in your transform module. Try exploring the documentation and see what other functions you can use in the transform module. You can also do the optional challenge below for more practice.

Optional Challenge: Using this notebook as reference, load the Seoul Bike Sharing Demand Dataset and run it through the five stages of the pipeline discussed here. You will first go through the data ingestion and validation components then finally, you will study the dataset's features and transform it to a format that a model can consume. Once you're done, you can visit this Discourse topic where one of your mentors, Fabio, has shared his solution. Feel free to discuss and share your solution as well!