Placeholder Image

字幕表 動画を再生する

  • JIRI SIMSA: Hi, everyone.

  • My name is Jiri.

  • I'm a software engineer on the TensorFlow team.

  • And today, I'm going to be talking to you about tf.data

  • and tf.distribute, which are TensorFlow's APIs for input

  • pipeline and distribution strategy, respectively.

  • To set the stage for what I'm going to be talking about,

  • let's think about what are the basic building

  • blocks for a machine learning workflow?

  • Machine learning operates over data.

  • It runs some computation.

  • And it uses some sort of hardware to do this task.

  • This hardware can either be a single CPU on your laptop.

  • Or possibly it can be on your workstation that

  • has either one or multiple accelerators,

  • either GPUs or TPUs, attached to it.

  • But you can also run the computation

  • across a large number of machines

  • that each have one or multiple accelerators attached to it.

  • Now, let's talk about the how the machine learning building

  • blocks are being served or reflected in the APIs

  • that TensorFlow provides.

  • So for the data handling part of the machine learning task,

  • TensorFlow provides a tf.data API.

  • It's the input pipeline API for TensorFlow.

  • For the computation itself, such as supervised learning,

  • TensorFlow offers a number of different both high level

  • and low level APIs.

  • You might be familiar with Keras or Estimators--

  • they've been mentioned in earlier talks today--

  • as well as lower level APIs for building custom training loops.

  • And finally, to hide the hardware details

  • of your computation, TensorFlow provides a tf.distribute API,

  • which allows you to create your input pipeline

  • and model in a way that's agnostic to the environment

  • in which it's going to execute.

  • So kind of thinking that your program is

  • going to run, perhaps, on a single device,

  • and with minimal changes being able to deploy it

  • on a large set of different devices, a possibility

  • of different machine learning architectures.

  • In this talk, I'm going to talk about the tf.data input

  • pipeline API.

  • And then, in the second part, I'm

  • also going to talk about the tf.distribute, the distribution

  • strategy API.

  • I'm not going to talk about Keras, and Estimator,

  • and other APIs for the modeling itself,

  • as that has been covered in previous talks.

  • So without further ado, let's get

  • started with tf.data, which is TensorFlow input pipeline API.

  • So let's ask ourselves a question.

  • Why do we need an input pipeline API in the first place?

  • Why don't we just load the data in memory,

  • maybe in our Python program as a non py array,

  • and pass it into a Keras model?

  • Well, there is actually a number of good reasons

  • why we need an API or why using one will benefit us.

  • First of all, the data might not fit into memory.

  • For example, the ImageNet data set

  • is 140 gigabytes of data, which do not necessarily

  • fit into memory on every laptop or workstation.

  • The data itself might also require randomized

  • preprocessing, which means that we cannot preprocess everything

  • ahead of time offline and then have the data to be ready

  • for training.

  • We actually need to have an input pipeline that

  • performs the preprocessing, such as, in the case of ImageNet,

  • perhaps image cropping or randomized image distortions

  • or transformations on the fly as we're

  • running dimensional learning computation.

  • Having an input pipeline API as an abstraction might also

  • allow us to, in the runtime of this API,

  • implement things in a way that allows the computation

  • to efficiently utilize the underlying hardware.

  • And I'm actually going to spend a fair amount of the first part

  • of my talk talking about how to efficiently utilize

  • the hardware through the tf.data input pipeline abstraction.

  • Last, but not least, which is something that ties the tf.data

  • API to the tf.distribution API, using an input pipeline API

  • abstraction allows us to decouple

  • the task of loading and preprocessing of the data

  • from the task of distributing the computation.

  • We are using the abstraction, which

  • allows you to create your input pipeline assuming

  • it's going to run on one place.

  • And then the distribution strategy

  • will somehow distribute the data without you

  • having to worry about the fact that the input pipeline might

  • actually be evaluated in multiple places in parallel.

  • So for those reasons, we created tf.data, TensorFlow's input

  • pipeline API.

  • And the way I like to think about tf.data is an input

  • pipeline API's created through tf.data--

  • it's an ETL process.

  • What I mean by that is the E, T, and L

  • stand for different parts of the input pipeline stages.

  • E stands for Extract.

  • This is the stage in which we read the data,

  • either from a memory or local or remote storage.

  • And we possibly parse the file format

  • that the data is stored in.

  • Perhaps it's compressed.

  • Then the T, the Transform stage, in this stage,

  • we perform either domain specific or domain

  • agnostic transformations.

  • So the domain specific transformations

  • are specific to the type of data we're dealing with.

  • So, for instance, text vectorization,

  • image transformation, or temporal video sampling

  • are examples of domain specific transformations.

  • While domain agnostic transformations

  • include things like shuffling of your data

  • during training or batching.

  • That is combining multiple elements

  • into a single higher dimensional element.

  • And, finally, the last stage of the input pipeline, Loading,

  • pertains to efficiently transferring

  • the data onto the accelerator, which is either a GPU or TPU.

  • What I should point out here is that, traditionally, the input

  • pipeline portion of your machine learning computation

  • happens on a CPU.

  • Because some of the operations are naturally

  • only possible on the CPU, which leaves the GPU and TPU

  • resources available for your machine

  • learning specific computations, such as your map models.

  • This makes-- this puts an extra pressure

  • on the efficiency with which the input pipeline performs.

  • And the reason for that is--

  • which is what I'm trying to illustrate here

  • with the graph--

  • is that over time the rate at which CPU performs

  • has plateaued, while the computational power of GPUs

  • and TPUs, thanks to recent hardware advances,

  • continues to accelerate at an exponential rate, which

  • opens up this performance gap between a raw CPU

  • and GPU/TPU processing power available in a single machine.

  • And that can-- the consequence of this

  • could be that the CPU part of your machine learning

  • computation, namely the input pipeline,

  • can be a bottleneck of your computation.

  • So it's really important that the CPU input pipeline performs

  • as efficiently as it can.

  • So let's take a look at an example of what a tf.data-based

  • input pipeline actually looks like.

  • Here, I'm using an example for a common image,

  • or how a common image processing input pipeline would look like.

  • We're first creating a data set using the TFRecordDataset

  • operation.

  • It's a data set constructor that takes a set of file names

  • or a set of file patterns and produces

  • elements that are stored in those files

  • in a sequence-like manner.

  • And once you create a data set, you

  • can chain transformations onto the data set,

  • thus creating new types of data sets.

  • A very common one and very powerful

  • one is the map transformation, which

  • allows you to apply an arbitrary processing on the elements

  • of the data set.

  • And this preprocessing can be expressed

  • as a function that ends up being traced

  • using the mechanisms available in TensorFlow,

  • meaning this function that is being used to transform

  • elements of the data set is executed as a data flow

  • graph, which has important implications

  • for the performance and how the runtime can actually

  • execute this function.

  • And the last thing that I illustrate here

  • is the batch transformation, which

  • combines multiple elements of the input data set

  • and produces a single element as an output that

  • has a higher dimension, which is a common practice for training

  • efficiency.

  • Now one thing that's not illustrated here,

  • but it actually does happen under the hoods inside

  • of tf.data runtime is that for certain combinations

  • of transformations, a. tf.data provides more efficient

  • fused implementations.

  • For instance, if a map transformation is followed

  • by a batch transformation, we actually have a highly

  • efficient C++ based implementation

  • for the combination of the two that can give you up to 2x

  • speed up in the performance of your input pipeline.

  • And that happens kind of magically behind the scenes.

  • And the important bit that I want to highlight here

  • is that the user doesn't need to worry about it.

  • The user here doesn't really need

  • to do anything with respect to optimizing the performance.

  • They focus on creating an input pipeline with the functional

  • preprocessing in mind.

  • And once you create the data set that you would like,

  • you can pass it into TensorFlow high level API such as Keras

  • or Estimator, which all support data set abstraction

  • as an input for the data.

  • So let's talk a bit more about the input pipeline performance.

  • If you were to implement the input pipeline

  • in naive fashion using CPU for the input pipeline processing

  • or data preparation and the GPU and TPU for the training

  • computation, you might end up in a situation

  • like is illustrated on the slide where

  • at any given point in time you're

  • only utilizing one of two resources available to you.

  • And you could probably tell that this seems rather inefficient.

  • Well, a common technique that can

  • be used to make this style of computation more efficient

  • is called software pipeline.

  • And the idea is that while you're

  • working on the current element for training step

  • on a GPU and a TPU, you're already

  • started preprocessing data for the next training

  • step on a CPU.

  • And thus, you overlap the computation

  • that happens on the two devices or two

  • resources available to you.

  • To achieve that, the effect of software pipelining in tf.data

  • is pretty straight forward.

  • All you do is you chain a .prefetch transformation

  • to a particular point in your input pipeline.

  • And the effect of doing that will

  • be that the producer of the data up to that point

  • will be decoupled from the consumer of the data,

  • in this case, the Keras model.

  • And the two will be operating independently,

  • coordinating through an internal buffer.

  • And this will have the desired effect of software

  • pipelining that I illustrated in the previous slide.

  • Another opportunity for improving

  • the performance of your input pipeline

  • is to parallelize the transformation.

  • So the top part of this diagram illustrates

  • that we're using sequential processing for applying the map

  • transformation of the individual elements of the batch

  • that we are then going to create.

  • But there is no reason that you need

  • to do that unless there would, in effect, be some sort of data

  • or control dependency.

  • But commonly, there is not.

  • An in that case, you can parallelize

  • and overlap the preprocessing of all the individual elements

  • for which we're going to create the batch out of.

  • So let's take a look at how we would

  • do that using the tf.data API.

  • And similar to the software pipelining idea,

  • this is pretty straightforward.

  • You simply add a single argument,

  • num_parallel_calls, to the map transformation,