Ungraded Lab: Feature Engineering with Weather Data

In this 1st exercise on feature engineering with time series data, you will practice data transformation with the Weather Dataset recorded by the Max Planck Institute for Biogeochemistry. You will be using tf.Transform here instead of TFX because, as of this version (1.4), it is more straightforward to preserve the sequence of your records using this framework. If you remember, TFX by default always shuffles the data when ingesting through the ExampleGen component and that is not ideal when preparing data for forecasting applications.

This dataset has 10-minute intervals of 14 different features such as air temperature, atmospheric pressure, and humidity. For this lab, you will use only the data collected between 2009 and 2016. This section of the dataset was prepared by François Chollet for his book Deep Learning with Python.

The table below shows the column names, their value formats, and their description.

Index Features Format Description
1 Date Time 01.01.2009 00:10:00 Date-time reference
2 p (mbar) 996.52 The pascal SI derived unit of pressure used to quantify internal pressure. Meteorological reports typically state atmospheric pressure in millibars.
3 T (degC) -8.02 Temperature in Celsius
4 Tpot (K) 265.4 Temperature in Kelvin
5 Tdew (degC) -8.9 Temperature in Celsius relative to humidity. Dew Point is a measure of the absolute amount of water in the air, the DP is the temperature at which the air cannot hold all the moisture in it and water condenses.
6 rh (%) 93.3 Relative Humidity is a measure of how saturated the air is with water vapor, the %RH determines the amount of water contained within collection objects.
7 VPmax (mbar) 3.33 Saturation vapor pressure
8 VPact (mbar) 3.11 Vapor pressure
9 VPdef (mbar) 0.22 Vapor pressure deficit
10 sh (g/kg) 1.94 Specific humidity
11 H2OC (mmol/mol) 3.12 Water vapor concentration
12 rho (g/m ** 3) 1307.75 Airtight
13 wv (m/s) 1.03 Wind speed
14 max. wv (m/s) 1.75 Maximum wind speed
15 wd (deg) 152.3 Wind direction in degrees

You will perform data preprocessing so that the features can be used to train an LSTM using TensorFlow and Keras downstream. You will not be asked to train a model as the focus is on feature preprocessing.

Upon completion, you will have

Install Packages

First, you will install the required packages for this lab. Most are included in the tensorflow_transform package.

Note: In Google Colab, you need to restart the runtime at this point to finalize updating the packages you just installed. You can do so by clicking the Restart Runtime at the end of the output cell above (after installation), or by selecting Runtime > Restart Runtime in the Menu bar. Please do not proceed to the next section without restarting. You can also ignore the errors about version incompatibility of some of the bundled packages because we won't be using those in this notebook.

Imports

Running the imports below should not show any error. Otherwise, please restart your runtime or re-run the package installation cell above.

Download the Data

Next, you will download the data and put it in your workspace.

You can now preview the dataset.

As you can see above, an observation is recorded every 10 minutes. This means that, for a single hour, you will have 6 observations. Similarly, a single day will contain 144 (6x24) observations.

Inspect the Data

You can then inspect the data to see if there are any issues that you need to address before feeding it to your model. First, you will generate some descriptive statistics using the describe() method of the dataframe.

You can see that the min value for wv (m/s) and max. wv(m/s) is -9999.0. Those are pretty intense winds and based on the other data points, these just look like faulty measurements. This is more pronounced when you visualize the data using the utilities below.

As you can see, there's a very big downward spike towards -9999 for the two features mentioned. You may know of different methods in handling outliers but for simplicity in this exercise, you will just set these to 0. You can visualize the plots again after making this change.

Take note that you are just visualizing the Pandas dataframe here. You will do this simple data cleaning again later when Tensorflow Transform consumes the raw CSV file.

Feature Engineering

Now you will be doing feature engineering. There are several things to note before doing the transformation:

Correlated features

You may want to drop redundant features to reduce the complexity of your model. Let's see what features are highly correlated with each other by plotting the correlation matrix.

You can observe that Tpot (K), Tdew (degC), VPmax(mbar), Vpact(mbar), VPdef (mbar), sh(g/kg) and H2OC are highly positively correlated to the target T (degC). Likewise, rho is highly negatively correlated to the target.

In the features that are positively correlated, you can see that VPmax (mbar) is highly correlated to some features like Vpact (mbar), Tdew (degC) and Tpot (K). Hence, for the sake of this exercise you can drop these features and retain VPmax (mbar).

Distribution of Wind Data

The last column of the data, wd (deg), gives the wind direction in units of degrees. However, angles in this current format do not make good model inputs. 360° and 0° should be close to each other and wrap around smoothly. Direction shouldn't matter if the wind is not blowing. This will be easier for the model to interpret if you convert the wind direction and velocity columns to a wind vector. Observe how sine and cosine are used to generate wind vector features (Wx and Wy) in the preprocessing_fn() later.

Date Time Feature

Dealing with weather, you can expect patterns depending on when the measurements are made. For example, temperatures are generally colder at night and wind velocities might be higher during typhoon season. Thus, the Date Time column is a good input to your model to take daily and yearly periodicity into account.

To do this, you will first use the datetime Python module to convert the current date time string format (i.e. day.month.year hour:minute:second) to a timestamp with units in seconds. Then, a simple approach to generate a periodic signal is to again use sine and cosine to convert the timestamp to clear "Time of day" (Day sin, Day cos) and "Time of year" (Year sin, Year cos) signals. You will see these conversions in the clean_fn() and preprocessing_fn().

You can see the clean_fn() utility function below that removes wind velocity outliers and converts the date time string to a Unix timestamp.

Create a tf.Transform preprocessing_fn

With the considerations above, you can now declare your preprocessing_fn(). This will be used by Tensorflow Transform to create a transformation graph that will preprocess model inputs. In a nutshell, your preprocessing function will perform the following steps:

  1. Perform feature selection by deleting the unwanted features.
  2. Transform wind direction and velocity columns into a wind vector.
  3. Convert date in timestamp to a usable signal by using sin and cos to convert the time to clear "Time of day" and "Time of year" signals.
  4. Normalize the float features.

Transform the data

You're almost ready to start transforming the data in an Apache Beam pipeline. Before doing so, you will declare just a few more utility functions and variables.

Train Test Split

First, you will define how your dataset will be split. You will use the first 300,000 observations for training and the rest for testing.

You will extract the date time object of the 300,000th observation to use it in partitioning the dataset using Beam.Partition(). This method expects a partition_fn() that returns an integer indicating the partition number. Since you will only need two (i.e. train and test), you will make the function return 0 when it is part of the train split, and 1 for the test. See how this is implemented in the cell below.

Declare Schema for Cleaned Data

Just like in the previous labs with TFX, you will want to declare a schema to make sure that your data input is parsed correctly. You can do that with the cell below. Take note that this will be used later after the data cleaning step. Thus, you can expect that the date time feature is now in seconds and assigned to be a float feature.

Create the tf.Transform pipeline

Now you can define the TF Transform pipeline. It will follow these major steps:

  1. Read in the data using the CSV reader
  2. Remove outliers and reformat timestamps using the clean_fn.
  3. Partition the dataset into train and test splits using the beam.Partition transform.
  4. Preprocess the data splits using the preprocessing_fn.
  5. Write the result as a TFRecord of Example protos.

Prepare Training and Test Datasets from TFTransformOutput

Now that you have the transformed dataset, you will need to map it to training and test datasets whch can be used for training a model using TensorFlow. Since this is time series data, it makes sense to group a fixed-length series of measurements and map that to the label found in a future time step. For example, 3 days of data can be used to predict the next day's humidity. You can use the tf.data.Dataset.window() method to implement these groupings.

In this exercise, you will use data from the last 5 days to predict the temperature 12 hours into the future.

Next, you will define several helper functions to extract the data from your transformed dataset and group it into windows. First, this parse_function() will help in getting the transformed features and rearranging them so that the label values (i.e. T (degC)) is at the end of the tensor.

Next, you will separate the features and target values into a tuple with this utility function.

Finally, you can define the dataset window with the function below. It uses the parameters defined above and also the helper functions to produce the batched feature-target mappings.

You can now use the get_dataset() function on your transformed examples to produce the dataset windows.

Let's preview the resulting shapes of the data windows. If you print the shapes of the tensors in a single batch, you'll notice that it indeed produced the required dimensions. It has 72 examples per batch where each contain 120 measurements for each of the 13 features in the transformed dataset. The target tensor shape only has one feature per example in the batch as expected (i.e. only T (degC)).

Wrap Up

In this notebook, you got to see how you may want to prepare seasonal data. It shows how you can handle periodicity and produce batches of dataset windows. You will be doing something similar in the next lab with sensor data. This time though, the measurements are taken at a much higher rate (20 Hz). The labels are also categorical so you will be handling that differently.

On to the next!