Ungraded lab: Algorithmic Dimensionality Reduction


Welcome, during this ungraded lab you are going to perform several algorithms that aim to reduce the dimensionality of data. This topic is very important because it is not uncommon that reduced models outperform the ones trained on the raw data because noise and redundant information are present in most datasets. This will also allow your models to train and make predictions faster, which might be really important depending on the problem you are working on. In particular you will:

  1. Use Principal Component Analysis (PCA) to reduce the dimensionality of a dataset that classifies celestial bodies.
  2. Use Single Value Decomposition (SVD) to create low level representations of images of handwritten digits.
  3. Use Non-negative Matrix Factorization (NMF) to segment text into topics.

Let's get started!

Principal Components Analysis - PCA

This is an unsupervised algorithm that creates linear combinations of the original features. PCA is a widely used technique for dimension reduction since it is fast and easy to implement. PCA aims to keep as much variance as possible from the original data in a lower dimensional space. It finds the best axis to project the data so that the variance of the projections is maximized.

In the lecture you saw PCA applied to the Iris dataset. This dataset has been used extensively to showcase PCA so here you are going to do something different. You are going to use the HTRU_2 dataset which describes several celestial objects and the idea is to be able to classify if an object is a pulsar star or not.

Begin by downloading the dataset:

Load the data into a dataframe for easier inspection:

This dataset has 8 numerical features (the "pulsar" column is the label). Now you are going to perform PCA reduce this 8th-dimensional input space to a lower dimensional one.

But first, scale the data. If you do an exploratory analysis of the data you will see that this dataset has a lot of outliers. Because of this you are going to use a RobustScaler, which scales features using statistics that are robust to outliers.

Now perform PCA using sklearn. In this first iteration you are going to create a principal component for each one of the features so there is no dimensionality reduction:

Wow! With just 3 components almost all of the variance of the original data is explained! This makes you think that there were some highly correlated features in the original data.

Let's plot the first 3 principal components:

It is possible to visualize a plane that would be able to separate both classes since non-pulsars tend to group on the edge of this surface while pulsars are mostly located on the inner side of the surface.

In this case it is reasonable to think that the dimension can be reduced even more since with 2 principal components more than 95% of the variance of the original data is explained. Now let's plot just the first two principal components:

Even in 2D the 2 classes look linearly separable (not perfectly, of course) but this is quite remarkable considering that the initial space was 8th dimensional.

Using PCA you've successfully reduced the dimensionality from 8 to 2 while maintaining a lot of the variance of the original data!

Singular Value Decomposition - SVD

SVD is one way to decompose matrices. Remember that matrices can be seen as linear transformations in space. PCA relies on eigendecomposition, which can only be done for square matrices. You might wonder why the first example worked with PCA if the data had far more observations than features. The reason is that when performing PCA you end up using the matrix product $X^{t}X$ which is a square matrix.

However you don’t always have square matrices, and sometimes you have really sparse matrices.

To decompose these types of matrices, which can’t be decomposed with eigendecomposition, you can use techniques such as Singular Value Decomposition. SVD decomposes the original dataset into its constituents, resulting in a reduction of dimensionality. It is used to remove redundant features from the dataset.

To check SVD you are going to use the digits dataset, which is made up of 1797 8x8 images of handwritten digits:

Let's continue by normalizing the data and checking its dimensions:

Plot the first digit to check how normalization affects the images:

The image should be identical to the one without normalization. This is because the relative brightness of each pixel with the others is maintained. Normalization is done as a preprocessing step when feeding the data into a Neural Network. Here it is done since it is a step that is usually always done when working with image data.

Now perform SVD on the data and plot the cumulative amount of explained variance for every number of components. Note that the TruncatedSVD needs a number of components strictly lower to the number of features.

Looking at the plot it can be seen that with only 5 components near the 50% of the variance of the original data is explained.

Let's double check this:

It is not a lot but let's check what you get when performing SVD with only 5 components:

By doing this you are now representing each digit using 5 dimensions instead of the original 64! Isn't that amazing?

Now check how this looks like visually:

It looks blurry but you can still tell this is a zero.

Using more components

Let’s try again, only, this time, we use half the number of features in the original data.

But first define a function that performs this process for any number of components:

Use the function to generate the image that use 32 components:

Wow! This image looks very similar to the original one (no wonder since more than 95% of the original variance is explained) but the dimensionality of the representations have been cut in half!

To better grasp how the images look like depending on the dimensionality of the representations, the next cell plots them side by side (the last figure has a parameter that you can tweak):

Notice how with 1 component it is not possible to determine that the image is a zero. What is the minimun number of components that are needed for this? Be sure to try out different values and see what you get!

Non-negative Matrix Factorization - NMF

NMF expresses samples as combinations of interpretable parts. For example, it represents documents as combinations of topics, and images in terms of commonly occurring visual patterns. NMF, like PCA, is a dimensionality reduction technique. In contrast to PCA, however, NMF models are interpretable. This means NMF models are easier to understand and much easier for us to explain to others. NMF can't be applied to every dataset, however. It requires the sample features be non-negative, so greater than or equal to 0.

To test NMF you will use the 20newsgroups dataset which comprises around 12000 newsgroups posts on 20 topics.

At this point you have the data in a list format. Let's check it out:

Notice that you only have the actual text without information of the topic it belongs to (labels).

Now you need to represent the text as vectors, for this you will use a TfidfVectorizer with max_features set to 500. This will be the original dimensionality of the data (which you will reduce via NMF).

Every one of the texts in the original data is represented as a 1x500 vector.

Now use NMF to reduce this dimensionality:

Now every data point is being represented by a vector of n_comp dimensions rather than the original 500!

In this case every component represents a topic and each data point is represented as a combination of those topics. The value for each topic can be interpreted as how strong the relationship between the text and that particular topic is.

Check this for the 1st element of the text data:

Looks like this text can be expressed as a combination of the first, fourth and fifth topic. Specially the later two.

At this point you might wonder what these topics are. Since we didn't provide labels, these topics arised from the data. To have a sense of what these topics are, plot the top 20 words for each topic:

Let's try to summarize each topic based on the top most common words for each one:

This makes sense considering the example with the first element of the text data. That text is mostly about cars (sports) and information.

Pretty cool, right?

The following function condenses the previously used code so you can play trying out with different number of components:

Congratulations on finishing this ungraded lab! Now you should have a clearer understanding of how to implement dimensionality reduction techniques.

The great thing about dimensionality reduction algorithms is that aside from making training and predicting faster, they perform some kind of automatic feature engineering by transforming the raw data into more meaningful representations.

Keep it up!