TensorFlowを使った研究（TF Dev Summit '20 (Research with TensorFlow (TF Dev Summit '20))

字幕表動画を再生する

[MUSIC PLAYING]
ALEXANDRE PASSOS: Hello, my name is Alex,
and I work on TensorFlow.
I am here today to tell you all a little bit about how
you can use TensorFlow to do deep learning research more
effectively.
What we're going to do today is we're
going to take a little tour of a few TensorFlow features that
show you how controllable, flexible, and composable
TensorFlow is.
We'll take a quick look at those features, some old
and some new.
And not, by far, all the features
are useful for research.
But these features let you accelerate
your research using TensorFlow in ways
that perhaps you're not aware of.
And I want to start by helping you control how TensorFlow
represents state.
If you've used TensorFlow before,
and I am sure you have at this point,
you know that a lot of our libraries
use TF variables to represent state,
like your model parameters.
And for example, a Keras dense layer
has one kernel matrix and an optional bias
vector stored in it.
And these parameters are updated when you train your model.
And part of the whole point of training models
is so that we find out what value those parameters should
have had in the first place.
And if you're making your own layers library,
you can control absolutely everything about how
that state is represented.
But you can also crack open the black box
and control how state is represented,
even inside the libraries that we give you.
So for example, we're going to use this little running example
of what if I wanted to re-parametrize a Keras
layer so it does some computation to generate
the kernel matrix, say to save space
or to get the correct inductive bias.
The way to do this is to use tf.variable_creator_scrope.
It is a tool we have that lets you take control of the state
creation process in TensorFlow.
It's a context manager, and all variables created under it
go through a function you specify.
And this function can choose to do nothing.
It can delegate.
Or it can modify how variables are created.
Under the hood, this is what distributionstrategy.scope
usually implies.
So it's the same tool that we use
to build TensorFlow that we make available to you,
so you can extend it.
And here, if I wanted to do this re-parametrization of the Keras
layer, it's actually pretty simple.
First, I define what type I want to use to store those things.
Here, I'm using this vectorize variable type,
which is a tf.module.
tf.modules are a very convenient type.
You can have variables as members,
and we can track them automatically
for you and all sorts of nice things.
And once we define this type, it's
really just a left half and right half.
I can tell TensorFlow how do I use
objects of this type as a part of TensorFlow computations.
And what we do here is we do a matrix multiplication
of the left component and the right component.
And now that I know how to use this object, I can create it.
And this is all that I need to make
my own little, variable_creator_scope.
In this case, I want to peek at the shape.
And if I'm not creating a matrix,
just delegate to whatever TensorFlow
would have done, normally.
And if I am creating a matrix, instead
of creating a single matrix, I'm going
to create this factorized variable that
has the left half and the right half.
And finally, I now get to just use it.
And here, I create a little Keras layer.
I apply it.
And I can check that it is indeed using
my vectorized representation.
This gives you a lot of power.
Because now, you can take large libraries of code
that you did not write and do dependency injection
to change how they behave.
Probably if you're going to do this at scale,
you might want to implement your own layer
so you can have full control.
But it's also very valuable for you
to be able to extend the ones that we provide you.
So use tf.variable_creator_scope to control the stage.
A big part of TensorFlow and why we
use these libraries to do research at all,
as opposed to just writing plain Python code,
is that deep learning is really dependent on very
fast computation.
And one thing that we're making more and more
easy to use in TensorFlow is our underlying compiler, XLA, which
we've always used for TPUs.
But now, we're making it easier for you to use
for CPUs and GPUs, as well.
And the way we're doing this is using tf.function with
the experimental_compile=True annotation.
What this means is if you mark a function as a function
that you want to compile, we will compile it,
or we'll raise an error.
So you can trust the code you write
inside a block is going to run as quickly as if you had
handwritten your own fuse TensorFlow kernel for CPUs,
and a Fuse.ko kernel, and then all the machinery, yourself.
But you get to write high level, fast, Python TensorFlow code.
One example where you might easily
find yourself writing your own little custom kernel
is if you want to do research on activation functions, which
is something that people want to do.
In activation functions, this is a terrible one.
But they tend to look a little like this.
They have a bunch of nonlinear operations
and a bunch of element-wise things.
But in general, they apply lots of
little element-wise operations to each element of your vector.
And these things, if you try to run them
in the normal TensorFlow interpreter,
they're going to be rather slow, because they're
going to do a new memory allocation and a copy of things
around for every single one of these little operations.
While if you were to make a fused, single kernel,
you just write a single thing for each coordinate that
does the explanation, and logarithm, and addition,
and all the things like that.
But what we can see here is that if I take this function,
and I wrap it with experimental_compile=True,
and I benchmark running a compiled version versus running
a non-compiled version, on this tiny benchmark,
I can already see a 25% speedup.
And it's even better than this, because we
see speedups of this sort of magnitude or larger,
even on fairly large models, including Bert.
Because in large models, we can fuse more computation
into the linear operations, and your reductions,
and things like that.
And this can get you compounding wins.
So try using experimental_compile=True
for automatic compilation in TensorFlow.
You should be able to apply it to small pieces of code
and replace what you'd normally have to do with fused kernels.
So you know what type of researching code a lot
of people rely on that has lots of very small element-wise
operations and that which would greatly benefit from the fusion
powers of a compiler--
I think it's optimizers.
And a nice thing about doing your optimizer research
in TensorFlow is that Keras makes it very easy
for you to implement your own stochastic gradient in style
optimizer.
You can make a class that subclasses
that TF Keras optimizer and override three methods.
You can define your initialization
while you compute your learning rate or whatever,
and you're in it.
You can create any accumulator variables, like your momentum,
or higher order powers of gradients, or anything else
you need, and create slots.
And you can define how to apply this optimizer
update to a single variable.
Once you've defined those three things,
you have everything TensorFlow needs
to be able to run your custom optimizer.
And normally, TensorFlow optimizers
are written with hand-fused kernels, which
can make the code very complicated to read,
but ensures that they run very quickly.
What I'm going to show here is an example
of a very simple optimizer-- again, not
a particularly good one.
This is a weird variation that has
some momentum and some higher order powers,
but it doesn't train very well.
However, it has the same sorts of operations that you
would have on a real optimizer.
And I can just write them as regular TensorFlow operations
in my model.
And by just adding this line with experimental_compile=True,
I can get it to run just as fast as a hand-fused kernel.
And the benchmarks are written here.
It was over a 2x speed up.
So this can really matter when you're
doing a lot of research that looks like this.
Something else-- so Keras optimizes in compilation.
You experiment really fast or with fairly intricate things,
and I hope you will use this to accelerate your research.
The next thing I want to talk about is vectorization.
It's, again, super important for performance.
I'm sure you've heard, at this point,
that Moore's Law is over, and we're no longer
going to get a free lunch in terms
of processes getting faster.
The way we're making our machine learning models faster
is by doing more and more things in parallel.
And this is great, because we get to unlock
the potential of GPUs and TPUs.
This is also a little scary, because now,
even though we know what we want to do to a single, little data
point, we have to write these batched operations, which
can be fairly complicated.
In TensorFlow, we've been developing,
recently, automatic vectorization for you,
where you can write the element-wise code that you want
to write and get the performance of the batched computation
that you want.
So the working example I'm going to use here is Jacobians.
If you're familiar with TensorFlow's gradient tape,
you know that tape.gradient computes an element-wise--
computes a gradient of a scalar, not a gradient
of a vector value or a matrix value function.
And if you want the Jacobian of a vector value
to a matrix valued function, you can just
call tape.gradient many, many times.
And here, I have a very, very simple function
that is just the explanation of the square of a matrix.
And I want to compute the Jacobian.
And I do this by writing this double
for loop, where for every row, for every column,
I compute the gradient with respect to the row and column
output, and then stack the results together
to get my higher order, Jacobian tensor.
This is fine.
This has always worked.
However, you can replace these explicit loops with
tf.vectorized_map.
And one, you get a small readability win.
Because now we're saying that, yes, you're
just applying this operation everywhere.
But also, you get a very big performance win.
And this version that uses tf.vectorized_map is
substantially faster than the version that doesn't use it.
But of course, you don't want to have
to write this all the time, which
is why, really, for Jacobians, we implemented it directly
in the gradient tape.
And you can call tape.Jacobian to get the Jacobian computer
for you.
And if you do this, it's over 10 times faster on this example
than doing the manual loop yourself because we can
do the automatic vectorization.
But the reason why I opened this black box
and showed you the previous slide
is so you can know how to implement something that is not
a Jacobian but is like Jacobian yourself-
and how you can use TensorFlow's automatic vectorization
capabilities together with the other tools you
have in your research to make you more productive.
So remember to use automatic vectorization,
so you can write short code that actually runs really fast.
And let us add the batched dimensions ourselves.
And here is another interesting performance point.
Because with TensorFlow, we have always
had the big, rectangular array or hyper-array, the tensor,
as the core data structure.
And tensors are great.
In a world where we live in today,
where we need to leverage as much parallelism
as we can to make our models go fast, operations on tensors
tend to be naturally highly parallel by default.
It's a very intuitive API to program the capabilities
of these supercomputers we have today, with many GPUs
and TPUs wired together.
And as long as you can stay within this tensor box,
you are happy.
You get peak performance.
And everything's great.
However, as deep learning becomes
more and more successful, and as we
want to do research on more and more different types of data,
we start to want to work with things that don't really
look like these big, rectangular arrays--
a structure that is ragged and has a different shape.
And in TensorFlow, we've been recently working
really hard at adding native support to ragged data.
So here's an example.
Pretend it's 10 years ago and you have a sentence.
You have a bunch of sentences.
They all have different lengths.
And you want to turn them into embedding so you can feed them
into a neural network.
So what you want to do here is you're
going to start with all the words in that sentence.
You're going to look up their index in your vocabulary table.
Then you're going to use the index to look up
a row in an embedding table.
And finally, you want to average the embeddings of each sentence
to get an embedding for each-- the embeddings of all
the words in a sentence to get an embedding for each sentence,
which you can then use in the rest of your model.
And even though we're working with ragged data
here, because all the sentences have different lengths, if you
think about the underlying operations that we're
doing here, most of them don't actually
have to care about this raggedness.
So we can make this run very efficiently
by decomposing this representation
into two things--
a tensor that concatenates across the ragged dimension
and a separate tensor that tells you
how to find the individual ragged elements in there.
And once you have this representation,
it's very easy and efficient to do all the computations
that we wanted to do to solve the task
from the previous slide.
You have always been able to do this manually in TensorFlow.
We've always had the features and capabilities for you
to do this.
Now, with tf.ragged_tensor, we're taking over
the management of this from you and just giving you an object,
a ragged tensor, that looks like a tensor.
It can be manipulated like a tensor,
but is represented like this.
And so it has ragged shapes and can
represent much more flexible data structures
than you could otherwise.
So let's go over a little bit of a code example, here.
Here is my data, same one from the previous slides.
It's just a Python list.
And I can take this Python list and turn it
into a ragged tensor by using tf.ragged.constant.
And the right thing is going to happen.
TensorFlow is going to automatically concatenate
across the ragged dimension and keep this array of indices
under the hood.
Then I can define my vocabulary table and do my lookup.
And here, I'm showing you how to do your lookup or any operation
on a ragged tensor where that operation hasn't actually
been rewritten to support raggedness.
You can always use tf.ragged.mapflatvalues
to access the underlying values of your ragged tensor,
and apply operations in them.
Once we dedicate an embedded matrix,
also, many of the TensorFlow core operations
have been adapted to work with ragged tensors.
So in this case, if you want to do a tf.gather
to find out the correct rows of the embedding
matrix for each word, you can just
apply your tf.gather on the ragged tensor,
and the right thing will happen.
And similarly, if you want to reduce and average out
the ragged dimension, it's very easy to do.
You can just use the standard tf.reduce_mean.
And the nice thing is that, at this point, because we've
reduced out the ragged dimension,
we have no ragged dimension.
And we just have a dense tensor that
has the original shape you expected to have.
And I think this is really important, because now, it's
much easier, much more intuitive and affordable for you
to work with data that doesn't necessarily
look like the big, rectangular data
that TensorFlow is optimized for.
And yet, it lets you get most of the performance
that you'd get with the big, rectangular data.
It's a win-win situation, and I'm really looking forward
to see what interesting applications you all
are going to work on that use and exploit
this notion of raggedness.
So please, play with tf.ragged.
Try it out.
It's very exciting.
So next up, we're going to go over
a particular, interesting example of research
done with TensorFlow.
And Achshai here, who is a PhD student at Stanford University,
is going to come and tell us all about convex optimization
layers in TensorFlow.
Thank you.
[MUSIC PLAYING]