Ungraded Lab: Building ML Pipelines with Kubeflow

In this lab, you will have some hands-on practice with Kubeflow Pipelines. As mentioned in the lectures, modern ML engineering is moving towards pipeline automation for rapid iteration and experiment tracking. This is especially useful in production deployments where models need to be frequently retrained to catch trends in newer data.

Kubeflow Pipelines is one component of the Kubeflow suite of tools for machine learning workflows. It is deployed on top of a Kubernetes cluster and builds an infrastructure for orchestrating ML pipelines and monitoring inputs and outputs of each component. You will use this tool in Google Cloud Platform in the first assignment this week and this lab will help prepare you for that by exploring its features on a local deployment. In particular, you will:

Let's begin!

Setup

You will need these tool installed in your local machine to complete the exercises:

  1. Docker - platform for building and running containerized applications. You should already have this installed from the previous ungraded labs. If not, you can see the instructions here. If you are using Docker for Desktop (Mac or Windows), you may need to increase the resource limits to start Kubeflow Pipelines later. You can click on the Docker icon in your Task Bar, choose Preferences and adjust the CPU to 4, Storage to 50GB, and the memory to at least 4GB (8GB recommended). Just make sure you are not maxing out any of these limits (i.e. the slider should ideally be at the midpoint or less) since it can make your machine slow or unresponsive. If you're constrained on resources, don't worry. You can still use this notebook as reference since we'll show the expected outputs at each step. The important thing is to become familiar with this Kubeflow Pipelines before you get more hands-on in the assignment.

  2. kubectl - tool for running commands on Kubernetes clusters. This should also be installed from the previous labs. If not, please see the instructions here

  3. kind - a Kubernetes distribution for running local clusters using Docker. Please follow the instructions here to install kind and create a local cluster. (NOTE: This lab currently does not support Kubernetes v1.22 and above. Please check the default Kubernetes image used by the kind version you are about to download here. If it is using v1.22 or higher, consider downloading an older version or using the --image flag when creating the cluster (e.g. kind create cluster --image=kindest/node:v1.19.1). After creating the cluster, you can check the Kubernetes version with the command kubectl version. This lab was tested using kind v0.9 running Kubernetes v1.19.1.)

  4. Kubeflow Pipelines (KFP) - a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers. Once you've created a local cluster using kind, you can deploy Kubeflow Pipelines with these commands. (NOTE: This lab was tested using KFP v1.7.0).

export PIPELINE_VERSION=1.7.0
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION&timeout=300"
kubectl wait --for condition=established --timeout=300s crd/applications.app.k8s.io
kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION&timeout=300"

You can enter the commands above one line at a time. These will setup all the deployments and spin up the pods for the entire application. These will be found in the kubeflow namespace. After sending the last command, it will take a moment (around 30 minutes) for all the deployments to be ready. You can send the command kubectl get deploy -n kubeflow a few times to check the status. You should see all deployments with the READY status before you can proceed to the next section.

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
cache-deployer-deployment         1/1     1            1           21h
cache-server                      1/1     1            1           21h
metadata-envoy-deployment         1/1     1            1           21h
metadata-grpc-deployment          1/1     1            1           21h
metadata-writer                   1/1     1            1           21h
minio                             1/1     1            1           21h
ml-pipeline                       1/1     1            1           21h
ml-pipeline-persistenceagent      1/1     1            1           21h
ml-pipeline-scheduledworkflow     1/1     1            1           21h
ml-pipeline-ui                    1/1     1            1           21h
ml-pipeline-viewer-crd            1/1     1            1           21h
ml-pipeline-visualizationserver   1/1     1            1           21h
mysql                             1/1     1            1           21h
workflow-controller               1/1     1            1           21h

When everything is ready, you can run the following command to access the ml-pipeline-ui service.

kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

The terminal should respond with something like this:

Forwarding from 127.0.0.1:8080 -> 3000
Forwarding from [::1]:8080 -> 3000

You can then open your browser and go to http://localhost:8080 to see the user interface.

kfp ui

Operationalizing your ML Pipelines

As you know, generating a trained model involves executing a sequence of steps. Here is a high level overview of what these steps might look like:

highlevel.jpg

You can recall the very first model you ever built and more likely than not, your code then also followed a similar flow. In essence, building an ML pipeline mainly involves implementing these steps but you will need to optimize your operations to deliver value to your team. Platforms such as Kubeflow helps you to build ML pipelines that can be automated, reproducible, and easily monitored. You will see these as you build your pipeline in the next sections below.

Pipeline components

The main building blocks of your ML pipeline are referred to as components. In the context of Kubeflow, these are containerized applications that run a specific task in the pipeline. Moreover, these components generate and consume artifacts from other components. For example, a download task will generate a dataset artifact and this will be consumed by a data splitting task. If you go back to the simple pipeline image above and describe it using tasks and artifacts, it will look something like this:

img/simple_dag.jpg

This relationship between tasks and their artifacts are what constitutes a pipeline and is also called a directed acyclic graph (DAG).

Kubeflow Pipelines let's you create components either by building the component specification directly or through Python functions. For this lab, you will use the latter since it is more intuitive and allows for quick iteration. As you gain more experience, you can explore building the component specification directly especially if you want to use different languages other than Python.

You will begin by installing the Kubeflow Pipelines SDK. Remember to restart the runtime to load the newly installed modules in Colab.

Note: Please do not proceed to the next steps without restarting the Runtime after installing kfp. You can do that by either pressing the Restart Runtime button at the end of the cell output above, or going to the Runtime button at the Colab toolbar above and selecting Restart Runtime.

Now you will import the modules you will be using to construct the Kubeflow pipeline. You will know more what these are for in the next sections.

In this lab, you will build a pipeline to train a multi-output model trained on the Energy Effeciency dataset from the UCI Machine Learning Repository. It uses the bulding features (e.g. wall area, roof area) as inputs and has two outputs: Cooling Load and Heating Load. You will follow the five-task graph above with some slight differences in the generated artifacts.

You will now build the component to load your data into the pipeline. The code is shown below and we will discuss the syntax in more detail after running it.

When building a component, it's good to determine first its inputs and outputs.

The inputs and outputs are declared as parameters in the function definition. As you can see in the code we defined a url parameter with a str type and an output_csv parameter with an Output[Dataset] type.

Lastly, you'll need to use the component decorator to specify that this is a Kubeflow Pipeline component. The documentation shows several parameters you can set and two of them are used in the code above. As the name suggests, the packages_to_install argument declares any extra packages outside the base image that is needed to run your code. As of writing, the default base image is python:3.7 so you'll need pandas and openpyxl to load the Excel file.

The output_component_file is an output file that contains the specification for your newly built component. You should see it in the Colab file explorer once you've ran the cell above. You'll see your code there and other settings that pertain to your component. You can use this file when building other pipelines if necessary. You don't have to redo your code again in a notebook in your next project as long as you have this YAML file. You can also pass this to your team members or use it in another machine. Kubeflow also hosts other reusable modules in their repo here. For example, if you want a file downloader component in one of your projects, you can load the component from that repo using the load_component_from_url function as shown below. The YAML file of that component should tell you the inputs and outputs so you can use it accordingly.

web_downloader_op = kfp.components.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/master/components/web/Download/component-sdk-v2.yaml')

Next, you will build the next component in the pipeline. Like in the previous step, you should design it first with inputs and outputs in mind. You know that the input of this component will come from the artifact generated by the download_data() function above. To declare input artifacts, you can annotate your parameter with the Input[Dataset] data type as shown below. For the outputs, you want to have two: train and test datasets. You can see the implementation below:

Building and Running a Pipeline

Now that you have at least two components, you can try building a pipeline just to quickly see how it works. The code is shown below. Basically, you just define a function with the sequence of steps then use the dsl.pipeline decorator. Notice in the last line (i.e. split_data_task) that to get a particular artifact from a previous step, you will need to use the outputs dictionary and use the parameter name as the key.

To generate your pipeline specification file, you need to compile your pipeline function using the Compiler class as shown below.

After running the cell, you'll see a pipeline.yaml file in the Colab file explorer. Please download that because it will be needed in the next step.

You can run a pipeline programmatically or from the UI. For this exercise, you will do it from the UI and you will see how it is done programmatically in the Qwiklabs later this week.

Please go back to the Kubeflow Pipelines UI and click Upload Pipelines from the Pipelines page.

upload.png

Next, select Upload a file and choose the pipeline.yaml you downloaded earlier then click Create. This will open a screen showing your simple DAG (just two tasks).

dag_kfp.png

Click Create Run then scroll to the bottom to input the URL of the Excel file: https://archive.ics.uci.edu/ml/machine-learning-databases/00242/ENB2012_data.xlsx . Then Click Start.

url.png

Select the topmost entry in the Runs page and you should see the progress of your run. You can click on the download-data box to see more details about that particular task (i.e. the URL input and the container logs). After it turns green, you should also see the output artifact and you can download it if you want by clicking the minio link.

progress.png

Eventually, both tasks will turn green indicating that the run completed successfully. Nicely done!

Generate the rest of the components

Now that you've seen a sample workflow, you can build the rest of the components for preprocessing, model training, and model evaluation. The functions will be longer because the task is more complex. Nonetheless, it follows the same principles as before such as declaring inputs and outputs, and specifying the additional packages.

In the eval_model() function, you'll notice the use of the log_metric() to record the results. You'll see this in the Visualizations tab of that task after it has completed.

Build and run the complete pipeline

You can then build and run the entire pipeline as you did earlier. It will take around 20 minutes for all the tasks to complete and you can see the Logs tab of each task to see how it's going. For instance, you can see there the model training epochs as you normally see in a notebook environment.

After you've uploaded and ran the entire pipeline, you should see all green boxes and the training metrics in the Visualizations tab of the eval-model task.

./img/complete_pipeline.png

Tear Down

If you're done experimenting with the software and want to free up resources, you can execute the commands below to delete Kubeflow Pipelines from your system:

export PIPELINE_VERSION=1.7.0
kubectl delete -k "github.com/kubeflow/pipelines/manifests/kustomize/env/platform-agnostic-pns?ref=$PIPELINE_VERSION"
kubectl delete -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"

You can delete the cluster for kind with the following:

kind delete cluster

Wrap Up

This lab demonstrated how you can use Kubeflow Pipelines to build and orchestrate your ML workflows. Having automated, shareable, and modular pipelines is a very useful feature in production deployments so you and your team can monitor and maintain your system more effectively. In the first Qwiklabs this week, you will use Kubeflow Pipelines as part of the Google Cloud AI Platform. You'll see more features implemented there such as integration with Tensorboard and more output visualizations from each component. If you want to know more, you can start with the Kubeflow Pipelines documentation and start conversations in Discourse.

Great job and on to the next part of the course!