tf.dataによるTensorflowデータ処理のスケーリング (TF Dev Summit '20) (Scaling Tensorflow data processing with tf.data (TF Dev Summit '20))

字幕表動画を再生する

ROHAN JAIN: Hi, all.
I'm Rohan, and I'm here to talk to you
about how you can scale up or input data processing
with tf.data.
So let's start with a high-level view of your ML training job.
Typically, your ML training step will have two phases to it.
The first is data preprocessing, where
you're going to look at the input files
and do all kinds of transformations on them
to make them ready for the next phase, which
is model computation.
While you're doing data preprocessing, which
happens in the CPU, you might be doing
some kind of things such as--
for images, you're cropping them.
For videos, you may be sampling them and whatnot.
So if your training speed is slow,
you could have a bottleneck in either one of these two places.
And I hope that the talk on profiling
would give you an indication on how
to figure out which one of the two phases
you're getting slow at.
And I'm here to talk to you about the first kind
of preprocessing bottleneck-- the bottleneck which
is data preprocessing.
So let's try to look into what this bottleneck really is.
So in the last few years we've done a fantastic job
making accelerators which do the ML operations really fast.
And so the amount of time it takes
us to do a matrix operation and all
the linear algebra our operations is a lot smaller.
But the hosts and the CPUs that feed the data
to these accelerators have not been able to keep up with them,
and so there ends up being a bottleneck.
We thought that we could mitigate this
by making the models more complex,
but what happens is that the accelerators have constraints
on how much RAM they have, and, more importantly,
where you deploy these models tends
to be something like a mobile device or something like that,
which tends to restrict the amount of complexity
you can introduce into your model.
So that hasn't really panned out.
The second approach people take is that they
try to turn larger batch sizes.
But larger batch sizes require a larger amount
of preprocessing to assemble the batch,
so then that puts further pressure on them.
So that's why this is becoming an increasingly larger problem
within Alphabet and even externally.
And I'm going to talk to you about how
you can solve it using tf.data.
tf.data is TensorFlow's data preprocessing framework.
It's fast, it's flexible, and it's easy to use.
And you can learn more about it at our guide.
For background for the rest of the talk,
I think I'm going to go through a typical tf.data pipeline,
and that'll help us in the later stages.
So suppose you have some data in some tf.data record files which
are your training data.
So you can now start off with the TF record
data set with that data.
And then after that, you start doing your preprocessing.
This is typically the bulk of the logic.
So if it's images, you're doing cropping, maybe flipping,
all sorts of things there.
After that, you shuffle the data so
that you don't train to the order in which you
see the examples and the input.
And that helps you with their training accuracy.
And after that, we will batch it so that the accelerator can now
make use of vectorized computations.
Finally, you want to do some software pipelining so that you
ensure that while the model is off
working on one batch of data, the preprocessing side can
produce the next batch so that everything
works very efficiently.
Finally, you can then feed this tf.data dataset
to a Keras model, so that you can now
start doing your training.
So given that sort of basic pipeline,
and suppose you have a bottleneck,
the first thing I'd recommend you to do
is to go through our single host performance guide,
and try to utilize every trick and transformation that
is available in tf-data to be able to extract
the maximum possible performance,
so that you're using all the [INAUDIBLE] and whatever.
There's excellent information at the guide that we have here.
And [INAUDIBLE] did a great talk at the ML Tokyo
Summit, which you can take a look at to learn more
about this.
So that's the first thing I'd recommend you do.
But suppose you have done that and you've
tried all the different recommendations that we have
here, but you're still bottlenecked on that data
preprocessing part.
And don't worry, you're not alone.
This is very common.
We've increasingly seen this with a lot
of internal customers.
And so now I'm very pleased to present a couple of solutions
that we've been working on on the team
to help you solve that problem.
So the first idea is that why don't we
just reuse the computation?
So suppose you're playing around with different model
architectures.
Your input pre-processing sort of part
kind of remains the same.
And if it's expensive and time-consuming, why don't we
just do it once, save it, and then
every subsequent time, we just read from it,
and do that quickly?
So we noticed a bunch of internal customers, teams
within Alphabet, who were trying to do this
on their own outside of tf.data, and we
decided to bring it in to tf.data
and make it incredibly fast, flexible, and easy to use.
And so this is what we call Snapshot.
The idea is what I explained to you.
You materialize the output of your data pre-processing once,
and then you can use it many, many times.
This is incredibly useful for playing around
with different model architectures
and if you settle down on an architecture doing
hyperparameter tuning.
And so you can get that speed up using Snapshot.
Next, I'm going to go through the pipeline
that we talked about before and see how you can add Snapshot
to it to make it faster.
So that's the original pipeline that we had.
And so notice that there's this pre-processing step, which
is expensive.
So now with Snapshot, you just add a snapshot transformation
right after that with a directory [INAUDIBLE]..
And with this, everything that is before the snapshot will now
be written to disk the first time it's run.
And then every subsequent time, we will just read from it.
And we would go through the rest of the steps as usual.
One thing I'd like to point out is
that we place the snapshot at a particular location
before the shuffle, because if it's after the shuffle,
everything gets frozen.
So all the randomization that you get out
of shuffle you lose, because every subsequent time,
you're just going to be reading the same exact order again
and again.
So that's why we introduce it at that stage in the pipeline.
So Snapshot, we developed it internally.
There are internal users and teams
that are using it and deriving benefit out of it.
And now we're bringing it to the open source world.
We published an RFC, which has more information about it
and some other technical details.
And this will be available in TensorFlow 2.3,
but I believe it will be available in the [INAUDIBLE]
shortly.
So remember, I talked about two ideas.
So the second idea is that, now, not all computation
is reusable, so because suppose you had someone randomized
crops in there.
And if you wrote that to disk and read them back,
you'd, again, lose that randomization.
And so a snapshot is probably not
applicable in that scenario.
So the second idea is to be able to distribute the computation.
So the initial setup is that you have one host CPU, which
is driving a bunch of these accelerators,
but now you can offload this computation
from this host to maybe a cluster.
And now you can utilize the ability
and the computational power that you
have for all these different workers
to be able to feed the host, so that you're not
bottlenecked on the input pre-processing anymore
and things move fast.
This is tf.data service.
It's a tf.data feature that allows you to scale
your workload horizontally.
So if you're seeing a slowness in your input pre-processing,
you can start adding workers, and it'll just scale up.
It's got a master-worker architecture, where
the master drives the work for the different workers
and it gives you fault tolerance.
So if one of the workers fails, you're still good
and you still can make progress.
So let's see how you can use the tf.data service for the example
that we have.
So here, instead of having sort of an expensive pre-processing,
let's say you have some randomized pre-processing.
So now this is not snapshotable, because if you snapshot, then
you lose the randomization.
So we'll provide you a binary which
allows you to run the data service on the cluster setup
manager that you like, whether it's Kubernetes or Cloud
or something like that.
And then once you have that up and running,
you can just add a distribute transformation
to your tf.data pipeline and provide the master address.
Anything before the distribute transformation
would now get run on the cluster that you have set up
and everything after will run on the host.
And so this allows you to sort of scale up.
Again, note that because we are not
doing any kind of freezing of the data,
we can output this transformation
as late as possible in there.
So notice that I've put it after the shuffle transformation.
The service, like Snapshot, has been
developed with internal users.
They've been using it.
And it's been, like, a game-changer in terms
of [INAUDIBLE] utilization.
And now, again, we're bringing it to you.
And so we published an RFC, which was well-received,
and this should be available in 2.3
for you to play around with.
So to summarize, what did I talk about today?
So as with various trends in hardware and software,
we've ended up in a scenario where a lot of input machine
learning jobs are getting bottlenecked
on input pre-processing.
And I've told about two solutions
that tf.data team has been working on
to help you solve this bottleneck.
First is Snapshot, which allows you
to reuse your pre-processing, so that you don't
have to do it multiple times.
And the second is the tf.data service,
which allows you to distribute this computation to a cluster,
so that you get the scale-up that you need.
I hope you play around with these and give us feedback.
And thank you for your time.
[MUSIC PLAYING]