字幕表 動画を再生する 英語字幕をプリント JIRI SIMSA: Hi, everyone. My name is Jiri. I'm a software engineer on the TensorFlow team. And today, I'm going to be talking to you about tf.data and tf.distribute, which are TensorFlow's APIs for input pipeline and distribution strategy, respectively. To set the stage for what I'm going to be talking about, let's think about what are the basic building blocks for a machine learning workflow? Machine learning operates over data. It runs some computation. And it uses some sort of hardware to do this task. This hardware can either be a single CPU on your laptop. Or possibly it can be on your workstation that has either one or multiple accelerators, either GPUs or TPUs, attached to it. But you can also run the computation across a large number of machines that each have one or multiple accelerators attached to it. Now, let's talk about the how the machine learning building blocks are being served or reflected in the APIs that TensorFlow provides. So for the data handling part of the machine learning task, TensorFlow provides a tf.data API. It's the input pipeline API for TensorFlow. For the computation itself, such as supervised learning, TensorFlow offers a number of different both high level and low level APIs. You might be familiar with Keras or Estimators-- they've been mentioned in earlier talks today-- as well as lower level APIs for building custom training loops. And finally, to hide the hardware details of your computation, TensorFlow provides a tf.distribute API, which allows you to create your input pipeline and model in a way that's agnostic to the environment in which it's going to execute. So kind of thinking that your program is going to run, perhaps, on a single device, and with minimal changes being able to deploy it on a large set of different devices, a possibility of different machine learning architectures. In this talk, I'm going to talk about the tf.data input pipeline API. And then, in the second part, I'm also going to talk about the tf.distribute, the distribution strategy API. I'm not going to talk about Keras, and Estimator, and other APIs for the modeling itself, as that has been covered in previous talks. So without further ado, let's get started with tf.data, which is TensorFlow input pipeline API. So let's ask ourselves a question. Why do we need an input pipeline API in the first place? Why don't we just load the data in memory, maybe in our Python program as a non py array, and pass it into a Keras model? Well, there is actually a number of good reasons why we need an API or why using one will benefit us. First of all, the data might not fit into memory. For example, the ImageNet data set is 140 gigabytes of data, which do not necessarily fit into memory on every laptop or workstation. The data itself might also require randomized preprocessing, which means that we cannot preprocess everything ahead of time offline and then have the data to be ready for training. We actually need to have an input pipeline that performs the preprocessing, such as, in the case of ImageNet, perhaps image cropping or randomized image distortions or transformations on the fly as we're running dimensional learning computation. Having an input pipeline API as an abstraction might also allow us to, in the runtime of this API, implement things in a way that allows the computation to efficiently utilize the underlying hardware. And I'm actually going to spend a fair amount of the first part of my talk talking about how to efficiently utilize the hardware through the tf.data input pipeline abstraction. Last, but not least, which is something that ties the tf.data API to the tf.distribution API, using an input pipeline API abstraction allows us to decouple the task of loading and preprocessing of the data from the task of distributing the computation. We are using the abstraction, which allows you to create your input pipeline assuming it's going to run on one place. And then the distribution strategy will somehow distribute the data without you having to worry about the fact that the input pipeline might actually be evaluated in multiple places in parallel. So for those reasons, we created tf.data, TensorFlow's input pipeline API. And the way I like to think about tf.data is an input pipeline API's created through tf.data-- it's an ETL process. What I mean by that is the E, T, and L stand for different parts of the input pipeline stages. E stands for Extract. This is the stage in which we read the data, either from a memory or local or remote storage. And we possibly parse the file format that the data is stored in. Perhaps it's compressed. Then the T, the Transform stage, in this stage, we perform either domain specific or domain agnostic transformations. So the domain specific transformations are specific to the type of data we're dealing with. So, for instance, text vectorization, image transformation, or temporal video sampling are examples of domain specific transformations. While domain agnostic transformations include things like shuffling of your data during training or batching. That is combining multiple elements into a single higher dimensional element. And, finally, the last stage of the input pipeline, Loading, pertains to efficiently transferring the data onto the accelerator, which is either a GPU or TPU. What I should point out here is that, traditionally, the input pipeline portion of your machine learning computation happens on a CPU. Because some of the operations are naturally only possible on the CPU, which leaves the GPU and TPU resources available for your machine learning specific computations, such as your map models. This makes-- this puts an extra pressure on the efficiency with which the input pipeline performs. And the reason for that is-- which is what I'm trying to illustrate here with the graph-- is that over time the rate at which CPU performs has plateaued, while the computational power of GPUs and TPUs, thanks to recent hardware advances, continues to accelerate at an exponential rate, which opens up this performance gap between a raw CPU and GPU/TPU processing power available in a single machine. And that can-- the consequence of this could be that the CPU part of your machine learning computation, namely the input pipeline, can be a bottleneck of your computation. So it's really important that the CPU input pipeline performs as efficiently as it can. So let's take a look at an example of what a tf.data-based input pipeline actually looks like. Here, I'm using an example for a common image, or how a common image processing input pipeline would look like. We're first creating a data set using the TFRecordDataset operation. It's a data set constructor that takes a set of file names or a set of file patterns and produces elements that are stored in those files in a sequence-like manner. And once you create a data set, you can chain transformations onto the data set, thus creating new types of data sets. A very common one and very powerful one is the map transformation, which allows you to apply an arbitrary processing on the elements of the data set. And this preprocessing can be expressed as a function that ends up being traced using the mechanisms available in TensorFlow, meaning this function that is being used to transform elements of the data set is executed as a data flow graph, which has important implications for the performance and how the runtime can actually execute this function. And the last thing that I illustrate here is the batch transformation, which combines multiple elements of the input data set and produces a single element as an output that has a higher dimension, which is a common practice for training efficiency. Now one thing that's not illustrated here, but it actually does happen under the hoods inside of tf.data runtime is that for certain combinations of transformations, a. tf.data provides more efficient fused implementations. For instance, if a map transformation is followed by a batch transformation, we actually have a highly efficient C++ based implementation for the combination of the two that can give you up to 2x speed up in the performance of your input pipeline. And that happens kind of magically behind the scenes. And the important bit that I want to highlight here is that the user doesn't need to worry about it. The user here doesn't really need to do anything with respect to optimizing the performance. They focus on creating an input pipeline with the functional preprocessing in mind. And once you create the data set that you would like, you can pass it into TensorFlow high level API such as Keras or Estimator, which all support data set abstraction as an input for the data. So let's talk a bit more about the input pipeline performance. If you were to implement the input pipeline in naive fashion using CPU for the input pipeline processing or data preparation and the GPU and TPU for the training computation, you might end up in a situation like is illustrated on the slide where at any given point in time you're only utilizing one of two resources available to you. And you could probably tell that this seems rather inefficient. Well, a common technique that can be used to make this style of computation more efficient is called software pipeline. And the idea is that while you're working on the current element for training step on a GPU and a TPU, you're already started preprocessing data for the next training step on a CPU. And thus, you overlap the computation that happens on the two devices or two resources available to you. To achieve that, the effect of software pipelining in tf.data is pretty straight forward. All you do is you chain a .prefetch transformation to a particular point in your input pipeline. And the effect of doing that will be that the producer of the data up to that point will be decoupled from the consumer of the data, in this case, the Keras model. And the two will be operating independently, coordinating through an internal buffer. And this will have the desired effect of software pipelining that I illustrated in the previous slide. Another opportunity for improving the performance of your input pipeline is to parallelize the transformation. So the top part of this diagram illustrates that we're using sequential processing for applying the map transformation of the individual elements of the batch that we are then going to create. But there is no reason that you need to do that unless there would, in effect, be some sort of data or control dependency. But commonly, there is not. An in that case, you can parallelize and overlap the preprocessing of all the individual elements for which we're going to create the batch out of. So let's take a look at how we would do that using the tf.data API. And similar to the software pipelining idea, this is pretty straightforward.