Placeholder Image

字幕表 動画を再生する

  • PRIYA GUPTA: Let's begin with the obvious question.

  • Why should one care about distributed training?

  • Training complex neural networks with large amounts of data

  • can often take a long time.

  • In the graph here, you can see training

  • the resident 50 model on a single but powerful GPU

  • can take up to four days.

  • If you have some experience running complex machine

  • learning models, this may sound rather familiar to you.

  • Bringing down your training time from days to hours

  • can have a significant effect on your productivity

  • because you can try out new ideas faster.

  • In this talk, we're going to talk

  • about distributed training, that is running training in parallel

  • on multiple devices such as CPUs, GPUs, or TPUs

  • to bring down your training time.

  • With the techniques that you-- we'll talk about in this talk,

  • you can bring down your training time from weeks or days

  • to hours with just a few lines of change of code

  • and some powerful hardware.

  • To achieve these goals, we're pleased to introduce

  • the new distribution strategy API.

  • This is an easy way to distribute your TensorFlow

  • training with very little modification to your code.

  • With distribution strategy API, you no longer

  • need to place ops or parameters on specific devices,

  • and you don't need to restructure a model in a way

  • that the losses and gradients get aggregated correctly

  • across the devices.

  • Distribution strategy takes care of all of that for you.

  • So let's go with what are the key goals of distribution

  • strategy.

  • The first one is ease of use.

  • We want you to make minimal code changes in order

  • to distribute your training.

  • The second is to give great performance out of the box.

  • Ideally, the user shouldn't have to change any--

  • change or configure any settings to get the most performance out

  • of their hardware.

  • And third we want distribution strategy

  • to work in a variety of different situations,

  • so whether you want to scale your training

  • on different hardware like GPUs or TPUs

  • or you want to use different APIs like Keras or estimator

  • or if you want to run distributed--

  • different distribution architectures

  • like synchronous or asynchronous training,

  • we have one distribution strategy to be useful for you

  • in all these situations.

  • So if you're just beginning with machine learning,

  • you might start your training with a multi-core CPU

  • on your desktop.

  • TensorFlow takes care of scaling onto a multi-core CPU

  • automatically.

  • Next, you may add a GPU to your desktop

  • to scale up your training.

  • As long as you build your program with the right CUDA

  • libraries, TensorFlow will automatically

  • run your training on the GPU and give you a nice performance

  • boost.

  • But what if you have multiple GPUs on your machine,

  • and you want to use all of them for your training?

  • This is where distribution strategy comes in.

  • In the next section, we're going to talk

  • about how you can use distribution strategy to scale

  • your training to multiple GPUs.

  • First, we'll look at some code to train the ResNet 50

  • model without any distribution.

  • We'll use a Keras API, which is the recommended TensorFlow

  • high level API.

  • We begin by creating some datasets

  • for training and validation using the TF data API.

  • For the model, we'll simply reuse

  • the ResNet 50 that's prepackaged with Keras and TensorFlow.

  • Then we create an optimizer that we'll be using in our training.

  • Once we have these pieces, we can compile the model providing

  • the loss and optimizer and maybe a few other things

  • like metrics, which I've omitted in the slide here.

  • Once a model's compiled, you can then begin your training

  • by calling model dot fit, providing the training

  • dataset that you created earlier, along with how many

  • epochs you want to run the training for.

  • Fit will train your model and update the models variables.

  • Then you can call evaluate with the validation dataset

  • to see how well your training did.

  • So given this code to run your training

  • on a single machine or a single GPU,

  • let's see how we can use distribution strategy

  • to now run it on multiple GPUs.

  • It's actually very simple.

  • You need to make only two changes.

  • First, create an instance of something called

  • mirrored strategy and second pass the strategy instance

  • to the compile call with the distribute argument.

  • That's it.

  • That's all the code changes you need

  • to now run this code on multiple GPUs using distribution

  • strategy.

  • Mirror strategy is a type of distribution strategy API

  • that we introduced earlier.

  • This API is available intensive on point 11 release,

  • which will be out very shortly.

  • And in the bottom of the slide, we've

  • linked to a complete example of training [INAUDIBLE]

  • with Keras and multiple GPUs that you can try out.

  • With mirror strategy, you don't need

  • to make any changes to your model code or your training

  • loop, so it makes it very easy to use.

  • This is because we've changed many underlying components

  • of TensorFlow to be distribution aware.

  • So this includes the optimizer, batch norm layers, metrics,

  • and summaries are all now distribution aware.

  • You don't need to make any changes to your input pipeline

  • as well as long as you're using the recommended TF data APIs.

  • And finally saving and checkpointing work

  • seamlessly as well.

  • So you can save with no or one distribution

  • strategy and a store with another seamlessly.

  • Now that you've seen some code on how

  • to use mirror strategy to scale to multiple GPUs,

  • let's look under the hood a little bit

  • and see what mirror strategy does.

  • In a nutshell, mirror strategy implements data parallelism

  • architecture.

  • It mirrors the variables on each device EGPU

  • and hence the name mirror strategy,

  • and it uses AllReduce to keep these variables in sync.

  • And using these techniques, it implements

  • synchronous training.

  • So that's a lot of terminology.

  • Let's unpack each of these a bit.

  • What is data parallelism?

  • Let's say you have end workers or end devices.

  • In data parallelism, each device runs the same model

  • and computation but for the different subset

  • of the input data.

  • Each device computes the loss and gradients

  • based on the training samples that it sees.

  • And then we combine these gradients

  • and update the models parameters.

  • The updated model is then used in the next round

  • of computation.

  • As I mentioned before, mirror strategy mirrors the variables

  • across the different devices.

  • So let's say you have a variable A your model.

  • It'll be replicated as A0, A1, A2, and A3

  • across the four different devices.

  • And together these four variables conceptually

  • form a single conceptual variable

  • called a mirrored variable.

  • These variables are kept in sync by applying identical updates.

  • A class of algorithms called AllReduce

  • can be used to keep variables in sync

  • by applying identical gradient updates.

  • AllReduce algorithms can be used to aggregate the gradients

  • across the different devices, for example,

  • by adding them up and making them available on each device.

  • It's a fused algorithm that can be very efficient

  • and reduce the overhead of synchronization by quite a bit.

  • There are many versions of algorithm--

  • AllReduce algorithms available based

  • on the communication available between the different devices.

  • One common algorithm is what is known as ring all-reduce.

  • In ring all-reduce, each device sends a chunk of its gradients

  • to its successor on the ring and receives another chunk

  • from its predecessor.

  • There are a few more such rounds of rate and exchanges,

  • and at the end of these exchanges,

  • each device has received a combined

  • copy of all the gradients.

  • Ring-all reduce also uses network bandwidth optimally

  • because it ensures that both the upload and download bandwidth

  • at each host is fully utilized.

  • We have a team working on fast implementations of all

  • reduce for various network topologies.

  • Some hardware vendors such as the Nvidia

  • provide specialized implementation

  • of all-reduce for their hardware, for example,

  • Nvidia [INAUDIBLE].

  • The bottom line is that AllReduce can be fast

  • when you have multiple devices on a single machine

  • or a small number of machines with strong connectivity.

  • Putting all these pieces together,

  • mirror strategy uses mirrored variables and all

  • reduce to implement synchronous training.

  • So let's see how that works.

  • Let's say you have two devices, device 0 and 1,

  • and your model has two layers, A and B. Each layer has

  • a single variable.

  • And as you can see, the variables

  • are replicated across the two devices.

  • Each device received one subset of the input data,

  • and it computes the forward pass using its local copy

  • of the variables.

  • It then computes a backward pass and computes the gradients.

  • Once agreements are computed on each device,

  • the devices communicate with each other

  • using all reduce to aggregate the gradients.

  • And once the gradients are aggregated,

  • each device updates its local copy of the variables.

  • So in this way, the devices are always kept in sync.

  • The next forward pass doesn't begin

  • until each device has received a copy of the combined gradients

  • and updated its variables.

  • All reduce can further optimize things and bring down

  • your training time by overlapping computation

  • of gradients at lower layers in the network with transmission

  • of gradients at the higher layers.

  • So in this case, you can see--

  • you can compute the gradients of layer A

  • while you're transmitting the gradients for layer B.

  • And this can further reduce your training time.

  • So now that we've seen how mirror strategy looks

  • under the hood, let's look at what type of performance

  • and scaling you can expect when using

  • mirror strategy with multi-- for multiple GPUs.

  • We use a ResNet 50 model with ImageNet dataset

  • for our benchmarking.

  • It's a very popular benchmark for performance measurement.

  • And we use Nvidia Teslas V100 GPUs on Google Cloud.

  • And we use a bat size of 128 per GPU.

  • On the x-axis here, you can see the number of GPUs,

  • and on the y-axis, you can see images per second process

  • during training.

  • As you can see, as we increase the number of GPUs

  • from one to two to four to eight,

  • the images per second processed is

  • close to doubling every time.

  • In fact, we're able to achieve 90% to 95% scaling out

  • of the box.

  • Note that these numbers were obtained by using the ResNet 50

  • model that's available in our official model garden depot,

  • and currently it uses the estimator API.

  • We're working on Keras performance actively.

  • So far, we've talked a lot about scaling onto multiple GPUs.

  • What about cloud TPUs?

  • TPU stands for a tensor processing units.

  • These are custom ASIC, designed and built by Google

  • especially for accelerating machine learning workloads.

  • In the picture here, you can see the various generations

  • of TPUs.

  • On the top left, you can see TPUE1.

  • In the middle you can see cloud TPUE2,

  • which is now generally available in Google Cloud.

  • And on the right side you can see

  • TPUE3, which was just announced in Google I/O a few months ago

  • and is now available in alpha.