Ungraded lab: Manual Feature Engineering


Welcome, during this ungraded lab you are going to perform feature engineering using TensorFlow and Keras. By having a deeper understanding of the problem you are dealing with and proposing transformations to the raw features you will see how the predictive power of your model increases. In particular you will:

  1. Define the model using feature columns.
  2. Use Lambda layers to perform feature engineering on some of these features.
  3. Compare the training history and predictions of the model before and after feature engineering.

Note: This lab has some tweaks compared to the code you just saw on the lectures. The major one being that time-related variables are not used in the feature engineered model.

Let's get started!

First, install and import the necessary packages, set up paths to work on and download the dataset.

Imports

Load taxifare dataset

For this lab you are going to use a tweaked version of the Taxi Fare dataset, which has been pre-processed and split beforehand.

First, create the directory where the data is going to be saved.

Now download the data in csv format from a cloud storage bucket.

Let's check that the files were copied correctly and look like we expect them to.

Everything looks fine. Notice that there are three files, one for each split of training, testing and validation.

Inspect tha data

Now take a look at the training data.

The data contains a total of 8 variables.

The fare_amount is the target, the continuous value we’ll train a model to predict. This leaves you with 7 features.

However this lab is going to focus on transforming the geospatial ones so the time features hourofday and dayofweek will be ignored.

Create an input pipeline

To load the data for the model you are going to use an experimental feature of Tensorflow that lets loading directly from a csv file.

For this you need to define some lists containing relevant information of the dataset such as the type of the columns.

Create a DNN Model in Keras

Now you will build a simple Neural Network with the numerical features as input represented by a DenseFeatures layer (which produces a dense Tensor based on the given features), two dense layers with ReLU activation functions and an output layer with a linear activation function (since this is a regression problem).

Since the model is defined using feature columns the first layer might look different to what you are used to. This is done by declaring two dictionaries, one for the inputs (defined as Input layers) and one for the features (defined as feature columns).

Then computing the DenseFeatures tensor by passing in the feature columns to the constructor of the DenseFeatures layer and passing in the inputs to the resulting tensor (this is easier to understand with code):

We'll build our DNN model and inspect the model architecture.

With the model architecture defined it is time to train it!

Train the model

You are going to train the model for 20 epochs using a batch size of 32.

Use the previously defined function to load the datasets from the original csv files.

Visualize training curves

Now lets visualize the training history of the model with the raw features:

The training history doesn't look very promising showing an erratic behaviour. Looks like the training process struggled to transverse the high dimensional space that the current features create.

Nevertheless let's use it for prediction.

Notice that the latitude and longitude values should revolve around (37, 45) and (-70, -78) respectively since these are the range of coordinates for New York city.

The model predicted this particular ride to be around 12 USD. However you know the model performance is not the best as it was showcased by the training history. Let's improve it by using Feature Engineering.

Improve Model Performance Using Feature Engineering

Going forward you will only use geo-spatial features as these are the most relevant when calculating the fare since this value is mostly dependant on the distance transversed:

Since you are dealing exclusively with geospatial data you will create some transformations that are aware of this geospatial nature. This help the model make a better representation of the problem at hand.

For instance the model cannot magically understand what a coordinate is supposed to represent and since the data is taken from New York only, the latitude and longitude revolve around (37, 45) and (-70, -78) respectively, which is arbitrary for the model. A good first step is to scale these values.

Notice all transformations are created by defining functions.

Another important fact is that the fare of a taxi ride is proportional to the distance of the ride. But as the features currently are, there is no way for the model to infer that the pair of (pickup_latitude, pickup_longitude) represent the point where the passenger started the ride and the pair (dropoff_latitude, dropoff_longitude) represent the point where the ride ended. More importantly, the model is not aware that the distance between these two points is crucial for predicting the fare.

To solve this, a new feature (which is a transformation of the other ones) that provides this information is required.

Applying transformations

Now you will define the transform function which will apply the previously defined transformation functions. To apply the actual transformations you will be using Lambda layers apply a function to values (in this case the inputs).

Update the model

Next, you'll create the DNN model now with the engineered (transformed) features.

Let's see how the model architecture has changed.

This plot is very useful for understanding the relationships and dependencies between the original and the transformed features!

Notice that the input of the model now consists of 5 features instead of the original 7, thus reducing the dimensionality of the problem.

Let's now train the model that includes feature engineering.

Notice that the features passenger_count, hourofday and dayofweek were excluded since they were omitted when defining the input pipeline.

Now lets visualize the training history of the model with the engineered features.

This looks a lot better than the previous training history! Now the loss and error metrics are decreasing with each epoch and both curves (train and validation) are very close to each other. Nice job!

Let's do a prediction with this new model on the example we previously used.

Wow, now the model predicts a fare that is roughly half of what the previous model predicted! Looks like the model with the raw features was overestimating the fare by a great margin.

Notice that you get a warning since the taxi_ride dictionary contains information about the unused features. You can supress it by redefining taxi_ride without these values but it is useful to know that Keras is smart enough to handle it on its own.

Congratulations on finishing this ungraded lab! Now you should have a clearer understanding of the importance and impact of performing feature engineering on your data.

This process is very domain-specific and requires a great understanding of the situation that is being modelled. Because of this, new techniques that switch from a manual to an automatic feature engineering have been developed and you will check some of them in an upcoming lab.

Keep it up!