Ungraded lab: Distributed Strategies with TF and Keras


Welcome, during this ungraded lab you are going to perform a distributed training strategy using TensorFlow and Keras, specifically the tf.distribute.MultiWorkerMirroredStrategy.

With the help of this strategy, a Keras model that was designed to run on single-worker can seamlessly work on multiple workers with minimal code change. In particular you will:

  1. Perform training with a single worker.
  2. Understand the requirements for a multi-worker setup (tf_config variable) and using context managers for implementing distributed strategies.
  3. Use magic commands to simulate different machines.
  4. Perform a multi-worker training strategy.

This notebook is based on the official Multi-worker training with Keras notebook, which covers some additional topics in case you want a deeper dive into this topic.

Distributed Training with TensorFlow guide is also available for an overview of the distribution strategies TensorFlow supports for those interested in a deeper understanding of tf.distribute.Strategy APIs.

Let's get started!

Setup

First, some necessary imports.

Before importing TensorFlow, make a few changes to the environment.

The previous step is important since this notebook relies on writting files using the magic command %%writefile and then importing them as modules.

Now that the environment configuration is ready, import TensorFlow.

Dataset and model definition

Next create an mnist.py file with a simple model and dataset setup. This python file will be used by the worker-processes in this tutorial.

The name of this file derives from the dataset you will be using which is called mnist and consists of 60,000 28x28 grayscale images of the first 10 digits.

Check that the file was succesfully created:

Import the mnist module you just created and try training the model for a small number of epochs to observe the results of a single worker to make sure everything works correctly.

Everything is working as expected!

Now you will see how multiple workers can be used as a distributed strategy.

Multi-worker Configuration

Now let's enter the world of multi-worker training. In TensorFlow, the TF_CONFIG environment variable is required for training on multiple machines, each of which possibly has a different role. TF_CONFIG is a JSON string used to specify the cluster configuration on each worker that is part of the cluster.

There are two components of TF_CONFIG: cluster and task.

Let's dive into how they are used:

cluster:

task:

Here is an example configuration:

Here is the same TF_CONFIG serialized as a JSON string:

Explaining the TF_CONFIG example

In this example you set a TF_CONFIG with 2 workers on localhost. In practice, users would create multiple workers on external IP addresses/ports, and set TF_CONFIG on each worker appropriately.

Since you set the task type to "worker" and the task index to 0, this machine is the first worker and will be appointed as the chief worker.

Note that other machines will need to have the TF_CONFIG environment variable set as well, and it should have the same cluster dict, but different task type or task index depending on what the roles of those machines are. For instance, for the second worker you would set tf_config['task']['index']=1.

Quick Note on Environment variables and subprocesses in notebooks

Above, tf_config is just a local variable in python. To actually use it to configure training, this dictionary needs to be serialized as JSON, and placed in the TF_CONFIG environment variable.

In the next section, you'll spawn new subprocesses for each worker using the %%bash magic command. Subprocesses inherit environment variables from their parent, so they can access TF_CONFIG.

You would never really launch your jobs this way (as subprocesses of an interactive Python runtime), but it's how you will do it for the purposes of this tutorial.

Choose the right strategy

In TensorFlow there are two main forms of distributed training:

MultiWorkerMirroredStrategy, which is the recommended strategy for synchronous multi-worker training is the one you will be using.

To train the model, use an instance of tf.distribute.MultiWorkerMirroredStrategy.

MultiWorkerMirroredStrategy creates copies of all variables in the model's layers on each device across all workers. It uses CollectiveOps, a TensorFlow op for collective communication, to aggregate gradients and keep the variables in sync. The official TF distributed training guide has more details about this.

Implement Distributed Training via Context Managers

To distribute the training to multiple-workers all you need to do is to enclose the model building and model.compile() call inside strategy.scope().

The distribution strategy's scope dictates how and where the variables are created, and in the case of MultiWorkerMirroredStrategy, the variables created are MirroredVariables, and they are replicated on each of the workers.

Note: TF_CONFIG is parsed and TensorFlow's GRPC servers are started at the time MultiWorkerMirroredStrategy() is called, so the TF_CONFIG environment variable must be set before a tf.distribute.Strategy instance is created.

Since TF_CONFIG is not set yet the above strategy is effectively single-worker training.

Train the model

Create training script

To actually run with MultiWorkerMirroredStrategy you'll need to run worker processes and pass a TF_CONFIG to them.

Like the mnist.py file written earlier, here is the main.py that each of the workers will run:

In the code snippet above note that the global_batch_size, which gets passed to Dataset.batch, is set to per_worker_batch_size * num_workers. This ensures that each worker processes batches of per_worker_batch_size examples regardless of the number of workers.

The current directory should now contain both Python files:

Set TF_CONFIG environment variable

Now json-serialize the TF_CONFIG and add it to the environment variables:

And terminate all background processes:

Launch the first worker

Now, you can launch a worker process that will run the main.py and use the TF_CONFIG:

There are a few things to note about the above command:

  1. It uses the %%bash which is a notebook "magic" to run some bash commands.
  2. It uses the --bg flag to run the bash process in the background, because this worker will not terminate. It waits for all the workers before it starts.

The backgrounded worker process won't print output to this notebook, so the &> redirects its output to a file, so you can see what happened.

So, wait a few seconds for the process to start up:

Now look what's been output to the worker's logfile so far using the cat command:

The last line of the log file should say: Started server with target: grpc://localhost:12345. The first worker is now ready, and is waiting for all the other worker(s) to be ready to proceed.

Launch the second worker

Now update the tf_config for the second worker's process to pick up:

Now launch the second worker. This will start the training since all the workers are active (so there's no need to background this process):

Now if you recheck the logs written by the first worker you'll see that it participated in training that model:

Unsurprisingly this ran slower than the the test run at the beginning of this tutorial. Running multiple workers on a single machine only adds overhead. The goal here was not to improve the training time, but only to give an example of multi-worker training.


Congratulations on finishing this ungraded lab! Now you should have a clearer understanding of how to implement distributed strategies with Tensorflow and Keras.

Although this tutorial didn't show the true power of a distributed strategy since this will require multiple machines operating under the same network, you now know how this process looks like at a high level.

In practice and especially with very big models, distributed strategies are commonly used as they provide a way of better managing resources to perform time-consuming tasks, such as training in a fraction of the time that it will take without the strategy.

Keep it up!