Ungraded Lab: Feature Engineering with Images

In this optional notebook, you will be looking at how to prepare features with an image dataset, particularly CIFAR-10. You will mostly go through the same steps but you will need to add parser functions in your transform module to successfully read and convert the data. As with the previous notebooks, we will just go briefly over the early stages of the pipeline so you can focus on the Transform component.

Let's begin!

Imports

Set up pipeline paths

Download example data

We will download the training split of the CIFAR-10 dataset and save it to the _data_filepath. Take note that this is already in TFRecord format so we won't need to convert it when we use ExampleGen later.

Create the InteractiveContext

Run TFX components interactively

ExampleGen

As mentioned earlier, the dataset is already in TFRecord format so, unlike the previous TFX labs, there is no need to convert it when we ingest the data. You can simply import it with ImportExampleGen and here is the syntax and modules for that.

As usual, this component produces two artifacts, training examples and evaluation examples:

You can also take a look at the first three training examples ingested by using the tf.io.parse_single_example() method from the tf.io module. See how it is setup in the cell below.

StatisticsGen

Next, you will generate the statistics so you can infer a schema in the next step. You can also look at the visualization of the statistics. As you might expect with CIFAR-10, there is a column for the image and another column for the numeric label.

SchemaGen

Here, you pass in the statistics to generate the Schema. For the version of TFX you are using, you will have to explicitly set infer_feature_shape=True so the downstream TFX components (e.g. Transform) will parse input as a Tensor and not SparseTensor. If not set, you will have compatibility issues later when you run the transform.

ExampleValidator

ExampleValidator is not required but you can still run it just to make sure that there are no anomalies.

Transform

To successfully transform the raw image, you need to parse the current bytes format and convert it to a tensor. For that, you can use the tf.image.decode_image() function. The transform module below utilizes this and converts the image to a (32,32,3) shaped float tensor. It also scales the pixels and converts the labels to one-hot tensors. The output features should then be ready to pass on to a model that accepts this format.

Now, we pass in this feature engineering code to the Transform component and run it to transform your data.

Preview the results

Now that the Transform component is finished, you can preview how the transformed images and labels look like. You can use the same sequence and helper function from previous labs.

You should see from the output of the cell below that the transformed raw image (i.e. image_raw_xf) now has a float array that is scaled from 0 to 1. Similarly, you'll see that the transformed label (i.e. label_xf) is now one-hot encoded.

Wrap Up

This notebook demonstrates how to do feature engineering with image datasets as opposed to simple tabular data. This should come in handy in your computer vision projects and you can also try replicating this process with other image datasets from TFDS.