字幕表 動画を再生する
-
[MUSIC PLAYING]
-
JIAN LI: Hello, everyone.
-
My name's Jian.
-
I'm a software engineer on the TensorFlow team.
-
Today, my colleague Pulkit and I will
-
be talking about the TensorFlow model optimization toolkit.
-
Model optimization means transforming your machine
-
learning models to make them efficient to execute.
-
That means faster computation as well as a lower memory,
-
storage, and battery usage.
-
And it is focused on inference instead of training.
-
And because of the above mentioned benefits,
-
optimization can unlock use cases
-
that are otherwise impossible.
-
Examples include speech recognition, face unlock,
-
object detection, music recognition, and many more.
-
The model optimization toolkit is a suite
-
of TensorFlow and TensorFlow Lite tools
-
that make it simple to optimize your model.
-
Optimization is an active research area
-
and there are many techniques.
-
Our goal is to prioritize the ones that
-
are general across model architectures
-
and across various hardware accelerators.
-
There are two major techniques in the toolkit, quantization
-
and pruning.
-
Quantization stimulates flow calculation in lower bits,
-
and pruning forces zero interconnection.
-
Today we are going to focus on quantization
-
and we'll briefly talk about pruning.
-
Now let's take a closer look at quantization.
-
Quantization is a general term describing technologies
-
that reduce the numerical precision of static parameters
-
and execute the operations in lower precision.
-
Precision reduction makes the model smaller,
-
and a lower precision execution makes the model faster.
-
Now let's dig a bit more onto how we perform quantization.
-
As a concrete example, imagine we have
-
a tensor with float values.
-
In most cases, we are wasting most of the representation
-
space in the float number line.
-
If we can find a linear transformation that
-
maps the float value onto int8, we can reduce the model size
-
by a factor of four.
-
Then computations can be carried out between int8 values,
-
and that is where the speed up comes from.
-
So there are two main approaches to do quantization, post
-
training and during training.
-
Post training operates on a already trained model
-
and is built on top of TensorFlow Lite converter.
-
During training, quantization performs additional weight
-
fine-tuning, and since training is required,
-
it is a build on top of a TensorFlow Keras API.
-
Different techniques offers a trade off
-
between ease of use and model accuracy.
-
The most easy to use technique is the dynamic range
-
quantization, which doesn't require any data.
-
There can be some accuracy loss but we get a two to three times
-
speed up.
-
Because floating point calculation
-
is still needed for the activation,
-
it's only meant to run on CPU.
-
If we want extra speed up on CPU or want
-
to run the model on hardware accelerators,
-
we can use integer quantization.
-
It runs a small set of unlabeled calibration data
-
to collect the min-max range on activation.
-
This removes the floating point calculation
-
in the computer graph, so there is a speed up on CPU.
-
But more importantly, it allows the model
-
to run on hardware accelerators such as DSP and TPU,
-
which are faster and more energy efficient than CPU.
-
And if accuracy is a concern, we can
-
use Quantization Aware Training to fine-tune the weights.
-
It has all the benefits of integer quantization,
-
but it requires training.
-
Now let's have a operator level breakdown on the post training
-
quantization.
-
Dynamic range quantization is fully supported
-
and integer quantization is supported
-
for most of the operators.
-
The missing piece is the recurrent neural network
-
support, and that blocks use cases
-
such as speech and language where a context is needed.
-
To unblock those use cases, we have recently
-
added a recurrent neural network quantization
-
and built a turnkey solution through the post training API.
-
RNN model build with Keras 2.0 can be converted and quantized
-
with the post training API.
-
This slide shows the end to end workflow
-
in the post training setup.
-
We create the TensorFlow Lite converter
-
and load the saved RNN model.
-
We then set the post training optimization flags
-
and provide calibration data.
-
After that, we are able to call the convert method to convert
-
and quantized the model.
-
This is the exact same API and workflow for models
-
without RNN, so there is no API change for the end users.
-
Let's take a look at the challenges
-
of the RNN quantization.
-
Quantization is a lossy transformation.
-
RNN cell has a memory state that persists
-
across multiple timestamps, so quantization errors
-
can accumulate in both the layer direction and the time
-
direction.
-
RNN cell contains many calculations,
-
and determining the number of bits and the scale
-
is a global optimization problem.
-
Also, quantized operations are restricted
-
by hardware capabilities.
-
Some operations are not allowed on certain hardware platforms.
-
We solved the challenge and created the quantization spec
-
for RNN.
-
The full spec is quite complicated,
-
and this slide shows this spec by zooming
-
into one of the LSTM gates.
-
As I mentioned, there are many calculations in one cell.
-
To balance performance and accuracy,
-
we keep eight bit calculations as much as possible
-
and it only goes to higher bits when required by accuracy.
-
As you can see from the diagram, metrics
-
related operations are in 8 bit, and web related operations
-
are a mixture of 8 bit and 16 bits.
-
And please note, the use of higher bits
-
is only internal to the cell.
-
The input and output activation for RNN cell are all 8 bits.
-
Now we see the details of RNN quantization.
-
Let's look at the accuracy and the performance.
-
This table shows some published accuracy numbers
-
on a few data sets.
-
It's a speech recognition model that consists
-
of 10 layers of quantized LSTM.
-
As you can see, integer quantized model
-
has the same accuracy as the dynamic range quantized model,
-
and the accuracy loss is negligible
-
compared with the float case.
-
Also, this is a permanent model, so RNN quantization
-
works with pruning as well.
-
As expected, there is a four time model size reduction
-
because static weights are quantized to 8 bits.
-
Performance-wise, there is a two to four times
-
speed up on a CPU and a more than 10 times speed
-
up on DSP and TPU.
-
So those numbers are consistent with the numbers
-
from other operators.
-
So here are the main takeaways.
-
TensorFlow now supports the RNN/LSTM quantization.
-
It is a turnkey solution through the post training API.
-
It enables smaller, faster, and a more energy
-
efficient execution that can run on DSP and TPU.
-
There are already production models
-
that use the quantization.
-
And please check the link for more details on the use cases.
-
Looking forward, our next step will
-
be to expand quantization to other recurrent neural
-
networks, such as the GRU and SRU.
-
We also plan to add Quantization Aware Training for RNN.
-
Now I'll hand it over to my colleague Pulkit.
-
Thank you.
-
PULKIT BHUWALKA: Thanks.
-
Thanks Jian.
-
Hi, my name is Pulkit.
-
I work on model optimization tool kitting.
-
And let's talk about--
-
clicker doesn't seem to be working.
-
Sorry, can we go back a slide?
-
Yes.
-
Quantization Aware Training.
-
So Quantization Aware Training is a training time technique
-
for improving the accuracy of quantized models.
-
The way it works is that we introduced
-
some of the errors which actually happened
-
during quantized inference into the training process,
-
and that actually helps the trainer learn around
-
these errors and get a more accurate model.
-
Now let's just try to get a sense of why is
-
this needed in the first place.
-
So we know that quantized models,
-
they run in lower precision, and because of that,
-
it's a lossy process, and that leads to an accuracy drop.
-
And while quantized models are super fast and we want them,
-
but nobody wants an accurate model.
-
So the goal is to kind of get the best of both worlds,
-
and that's why we have this system.
-
To get a sense of why these losses get introduced,
-
one is that we actually have a--
-
once we have quantized models, these parameters
-
are in lower precision.
-
So, in a sense, you have more coarse information, fewer
-
buckets of information.
-
So that's where you have information representation
-
loss.
-
The other problem is that, when you're actually
-
doing these computations, then you have computation loss
-
when you're actually adding to coarse values instead
-
of finer buckets of values.
-
Typically, during matrix multiplication type
-
of operations, even if you're doing it at int8,
-
you accumulate these values to int32,
-
and then you rescale them back to int8,
-
so you have that rescaling loss.
-
The other thing is that, generally,
-
when we run these quantized models during inference,
-
there are various inference optimizations that
-
get applied to the graph, and because of that,
-
the training graph and the inference graph
-
can be subtly different, which also can potentially
-
introduce some of these errors.
-
And how do we recover lost accuracy?
-
Well, for starters, we try to make the training graph as
-
similar as possible to the inference graph
-
to remove these subtle differences.
-
And the other is that we actually
-
introduce these errors which actually happened
-
during inference, so the trainer learns around it
-
and machine learning does its magic.
-
So for example, when it comes to mimicking errors,
-
as you can see in the graph here,
-
you go from weights to lower precision.
-
So let's say if your weights are in floating point,
-
you go down to int8, and then you go back up
-
to floating point.
-
So in that sense, you've actually
-
mimicked what happens during inference when you're
-
executing at lower precision.
-
Then you actually do your computation,
-
and because both your inputs and your weights are at int8
-
and the losses have been introduced,
-
the computation happens correctly.
-
But then after the computation, you
-
add another fake quant to kind of drop
-
back to lower precision.
-
The other thing is we model the inference part.
-
So for example, if you noticed in the previous slide,
-
the fake quant operation came after the value activation.
-
So this is one of the optimizations
-
that happened during inference, that the value gets folded in.
-
And what we do is that when we're actually constructing
-
your graph, we make sure that these sorts of optimizations
-
get added in.
-
And let's look at the numbers.
-
So the numbers are pretty good.
-
So if you look at the slide, we're
-
almost as close as the float baseline on various version
-
models that we've tried.
-
So this is really powerful.
-
You can actually execute a model which
-
gives you nearly as good accuracy and is quantized.