字幕表 動画を再生する 英語字幕をプリント PRIYA GUPTA: Let's begin with the obvious question. Why should one care about distributed training? Training complex neural networks with large amounts of data can often take a long time. In the graph here, you can see training the resident 50 model on a single but powerful GPU can take up to four days. If you have some experience running complex machine learning models, this may sound rather familiar to you. Bringing down your training time from days to hours can have a significant effect on your productivity because you can try out new ideas faster. In this talk, we're going to talk about distributed training, that is running training in parallel on multiple devices such as CPUs, GPUs, or TPUs to bring down your training time. With the techniques that you-- we'll talk about in this talk, you can bring down your training time from weeks or days to hours with just a few lines of change of code and some powerful hardware. To achieve these goals, we're pleased to introduce the new distribution strategy API. This is an easy way to distribute your TensorFlow training with very little modification to your code. With distribution strategy API, you no longer need to place ops or parameters on specific devices, and you don't need to restructure a model in a way that the losses and gradients get aggregated correctly across the devices. Distribution strategy takes care of all of that for you. So let's go with what are the key goals of distribution strategy. The first one is ease of use. We want you to make minimal code changes in order to distribute your training. The second is to give great performance out of the box. Ideally, the user shouldn't have to change any-- change or configure any settings to get the most performance out of their hardware. And third we want distribution strategy to work in a variety of different situations, so whether you want to scale your training on different hardware like GPUs or TPUs or you want to use different APIs like Keras or estimator or if you want to run distributed-- different distribution architectures like synchronous or asynchronous training, we have one distribution strategy to be useful for you in all these situations. So if you're just beginning with machine learning, you might start your training with a multi-core CPU on your desktop. TensorFlow takes care of scaling onto a multi-core CPU automatically. Next, you may add a GPU to your desktop to scale up your training. As long as you build your program with the right CUDA libraries, TensorFlow will automatically run your training on the GPU and give you a nice performance boost. But what if you have multiple GPUs on your machine, and you want to use all of them for your training? This is where distribution strategy comes in. In the next section, we're going to talk about how you can use distribution strategy to scale your training to multiple GPUs. First, we'll look at some code to train the ResNet 50 model without any distribution. We'll use a Keras API, which is the recommended TensorFlow high level API. We begin by creating some datasets for training and validation using the TF data API. For the model, we'll simply reuse the ResNet 50 that's prepackaged with Keras and TensorFlow. Then we create an optimizer that we'll be using in our training. Once we have these pieces, we can compile the model providing the loss and optimizer and maybe a few other things like metrics, which I've omitted in the slide here. Once a model's compiled, you can then begin your training by calling model dot fit, providing the training dataset that you created earlier, along with how many epochs you want to run the training for. Fit will train your model and update the models variables. Then you can call evaluate with the validation dataset to see how well your training did. So given this code to run your training on a single machine or a single GPU, let's see how we can use distribution strategy to now run it on multiple GPUs. It's actually very simple. You need to make only two changes. First, create an instance of something called mirrored strategy and second pass the strategy instance to the compile call with the distribute argument. That's it. That's all the code changes you need to now run this code on multiple GPUs using distribution strategy. Mirror strategy is a type of distribution strategy API that we introduced earlier. This API is available intensive on point 11 release, which will be out very shortly. And in the bottom of the slide, we've linked to a complete example of training [INAUDIBLE] with Keras and multiple GPUs that you can try out. With mirror strategy, you don't need to make any changes to your model code or your training loop, so it makes it very easy to use. This is because we've changed many underlying components of TensorFlow to be distribution aware. So this includes the optimizer, batch norm layers, metrics, and summaries are all now distribution aware. You don't need to make any changes to your input pipeline as well as long as you're using the recommended TF data APIs. And finally saving and checkpointing work seamlessly as well. So you can save with no or one distribution strategy and a store with another seamlessly. Now that you've seen some code on how to use mirror strategy to scale to multiple GPUs, let's look under the hood a little bit and see what mirror strategy does. In a nutshell, mirror strategy implements data parallelism architecture. It mirrors the variables on each device EGPU and hence the name mirror strategy, and it uses AllReduce to keep these variables in sync. And using these techniques, it implements synchronous training. So that's a lot of terminology. Let's unpack each of these a bit. What is data parallelism? Let's say you have end workers or end devices. In data parallelism, each device runs the same model and computation but for the different subset of the input data. Each device computes the loss and gradients based on the training samples that it sees. And then we combine these gradients and update the models parameters. The updated model is then used in the next round of computation. As I mentioned before, mirror strategy mirrors the variables across the different devices. So let's say you have a variable A your model. It'll be replicated as A0, A1, A2, and A3 across the four different devices. And together these four variables conceptually form a single conceptual variable called a mirrored variable. These variables are kept in sync by applying identical updates. A class of algorithms called AllReduce can be used to keep variables in sync by applying identical gradient updates. AllReduce algorithms can be used to aggregate the gradients across the different devices, for example, by adding them up and making them available on each device. It's a fused algorithm that can be very efficient and reduce the overhead of synchronization by quite a bit. There are many versions of algorithm-- AllReduce algorithms available based on the communication available between the different devices. One common algorithm is what is known as ring all-reduce. In ring all-reduce, each device sends a chunk of its gradients to its successor on the ring and receives another chunk from its predecessor. There are a few more such rounds of rate and exchanges, and at the end of these exchanges, each device has received a combined copy of all the gradients. Ring-all reduce also uses network bandwidth optimally because it ensures that both the upload and download bandwidth at each host is fully utilized. We have a team working on fast implementations of all reduce for various network topologies. Some hardware vendors such as the Nvidia provide specialized implementation of all-reduce for their hardware, for example, Nvidia [INAUDIBLE]. The bottom line is that AllReduce can be fast when you have multiple devices on a single machine or a small number of machines with strong connectivity. Putting all these pieces together, mirror strategy uses mirrored variables and all reduce to implement synchronous training. So let's see how that works. Let's say you have two devices, device 0 and 1, and your model has two layers, A and B. Each layer has a single variable. And as you can see, the variables are replicated across the two devices. Each device received one subset of the input data, and it computes the forward pass using its local copy of the variables. It then computes a backward pass and computes the gradients. Once agreements are computed on each device, the devices communicate with each other using all reduce to aggregate the gradients. And once the gradients are aggregated, each device updates its local copy of the variables. So in this way, the devices are always kept in sync. The next forward pass doesn't begin until each device has received a copy of the combined gradients and updated its variables. All reduce can further optimize things and bring down your training time by overlapping computation of gradients at lower layers in the network with transmission of gradients at the higher layers. So in this case, you can see-- you can compute the gradients of layer A while you're transmitting the gradients for layer B. And this can further reduce your training time. So now that we've seen how mirror strategy looks under the hood, let's look at what type of performance and scaling you can expect when using mirror strategy with multi-- for multiple GPUs. We use a ResNet 50 model with ImageNet dataset for our benchmarking. It's a very popular benchmark for performance measurement. And we use Nvidia Teslas V100 GPUs on Google Cloud. And we use a bat size of 128 per GPU. On the x-axis here, you can see the number of GPUs, and on the y-axis, you can see images per second process during training. As you can see, as we increase the number of GPUs from one to two to four to eight, the images per second processed is close to doubling every time. In fact, we're able to achieve 90% to 95% scaling out of the box. Note that these numbers were obtained by using the ResNet 50 model that's available in our official model garden depot, and currently it uses the estimator API. We're working on Keras performance actively. So far, we've talked a lot about scaling onto multiple GPUs. What about cloud TPUs? TPU stands for a tensor processing units. These are custom ASIC, designed and built by Google especially for accelerating machine learning workloads. In the picture here, you can see the various generations of TPUs. On the top left, you can see TPUE1. In the middle you can see cloud TPUE2, which is now generally available in Google Cloud. And on the right side you can see TPUE3, which was just announced in Google I/O a few months ago and is now available in alpha.