Placeholder Image

字幕表 動画を再生する

  • [MUSIC PLAYING]

  • JIAN LI: Hello, everyone.

  • My name's Jian.

  • I'm a software engineer on the TensorFlow team.

  • Today, my colleague Pulkit and I will

  • be talking about the TensorFlow model optimization toolkit.

  • Model optimization means transforming your machine

  • learning models to make them efficient to execute.

  • That means faster computation as well as a lower memory,

  • storage, and battery usage.

  • And it is focused on inference instead of training.

  • And because of the above mentioned benefits,

  • optimization can unlock use cases

  • that are otherwise impossible.

  • Examples include speech recognition, face unlock,

  • object detection, music recognition, and many more.

  • The model optimization toolkit is a suite

  • of TensorFlow and TensorFlow Lite tools

  • that make it simple to optimize your model.

  • Optimization is an active research area

  • and there are many techniques.

  • Our goal is to prioritize the ones that

  • are general across model architectures

  • and across various hardware accelerators.

  • There are two major techniques in the toolkit, quantization

  • and pruning.

  • Quantization stimulates flow calculation in lower bits,

  • and pruning forces zero interconnection.

  • Today we are going to focus on quantization

  • and we'll briefly talk about pruning.

  • Now let's take a closer look at quantization.

  • Quantization is a general term describing technologies

  • that reduce the numerical precision of static parameters

  • and execute the operations in lower precision.

  • Precision reduction makes the model smaller,

  • and a lower precision execution makes the model faster.

  • Now let's dig a bit more onto how we perform quantization.

  • As a concrete example, imagine we have

  • a tensor with float values.

  • In most cases, we are wasting most of the representation

  • space in the float number line.

  • If we can find a linear transformation that

  • maps the float value onto int8, we can reduce the model size

  • by a factor of four.

  • Then computations can be carried out between int8 values,

  • and that is where the speed up comes from.

  • So there are two main approaches to do quantization, post

  • training and during training.

  • Post training operates on a already trained model

  • and is built on top of TensorFlow Lite converter.

  • During training, quantization performs additional weight

  • fine-tuning, and since training is required,

  • it is a build on top of a TensorFlow Keras API.

  • Different techniques offers a trade off

  • between ease of use and model accuracy.

  • The most easy to use technique is the dynamic range

  • quantization, which doesn't require any data.

  • There can be some accuracy loss but we get a two to three times

  • speed up.

  • Because floating point calculation

  • is still needed for the activation,

  • it's only meant to run on CPU.

  • If we want extra speed up on CPU or want

  • to run the model on hardware accelerators,

  • we can use integer quantization.

  • It runs a small set of unlabeled calibration data

  • to collect the min-max range on activation.

  • This removes the floating point calculation

  • in the computer graph, so there is a speed up on CPU.

  • But more importantly, it allows the model

  • to run on hardware accelerators such as DSP and TPU,

  • which are faster and more energy efficient than CPU.

  • And if accuracy is a concern, we can

  • use Quantization Aware Training to fine-tune the weights.

  • It has all the benefits of integer quantization,

  • but it requires training.

  • Now let's have a operator level breakdown on the post training

  • quantization.

  • Dynamic range quantization is fully supported

  • and integer quantization is supported

  • for most of the operators.

  • The missing piece is the recurrent neural network

  • support, and that blocks use cases

  • such as speech and language where a context is needed.

  • To unblock those use cases, we have recently

  • added a recurrent neural network quantization

  • and built a turnkey solution through the post training API.

  • RNN model build with Keras 2.0 can be converted and quantized

  • with the post training API.

  • This slide shows the end to end workflow

  • in the post training setup.

  • We create the TensorFlow Lite converter

  • and load the saved RNN model.

  • We then set the post training optimization flags

  • and provide calibration data.

  • After that, we are able to call the convert method to convert

  • and quantized the model.

  • This is the exact same API and workflow for models

  • without RNN, so there is no API change for the end users.

  • Let's take a look at the challenges

  • of the RNN quantization.

  • Quantization is a lossy transformation.

  • RNN cell has a memory state that persists

  • across multiple timestamps, so quantization errors

  • can accumulate in both the layer direction and the time

  • direction.

  • RNN cell contains many calculations,

  • and determining the number of bits and the scale

  • is a global optimization problem.

  • Also, quantized operations are restricted

  • by hardware capabilities.

  • Some operations are not allowed on certain hardware platforms.

  • We solved the challenge and created the quantization spec

  • for RNN.

  • The full spec is quite complicated,

  • and this slide shows this spec by zooming

  • into one of the LSTM gates.

  • As I mentioned, there are many calculations in one cell.

  • To balance performance and accuracy,

  • we keep eight bit calculations as much as possible

  • and it only goes to higher bits when required by accuracy.

  • As you can see from the diagram, metrics

  • related operations are in 8 bit, and web related operations

  • are a mixture of 8 bit and 16 bits.

  • And please note, the use of higher bits

  • is only internal to the cell.

  • The input and output activation for RNN cell are all 8 bits.

  • Now we see the details of RNN quantization.

  • Let's look at the accuracy and the performance.

  • This table shows some published accuracy numbers

  • on a few data sets.

  • It's a speech recognition model that consists

  • of 10 layers of quantized LSTM.

  • As you can see, integer quantized model

  • has the same accuracy as the dynamic range quantized model,

  • and the accuracy loss is negligible

  • compared with the float case.

  • Also, this is a permanent model, so RNN quantization

  • works with pruning as well.

  • As expected, there is a four time model size reduction

  • because static weights are quantized to 8 bits.

  • Performance-wise, there is a two to four times

  • speed up on a CPU and a more than 10 times speed

  • up on DSP and TPU.

  • So those numbers are consistent with the numbers

  • from other operators.

  • So here are the main takeaways.

  • TensorFlow now supports the RNN/LSTM quantization.

  • It is a turnkey solution through the post training API.

  • It enables smaller, faster, and a more energy

  • efficient execution that can run on DSP and TPU.

  • There are already production models

  • that use the quantization.

  • And please check the link for more details on the use cases.

  • Looking forward, our next step will

  • be to expand quantization to other recurrent neural

  • networks, such as the GRU and SRU.

  • We also plan to add Quantization Aware Training for RNN.

  • Now I'll hand it over to my colleague Pulkit.

  • Thank you.

  • PULKIT BHUWALKA: Thanks.

  • Thanks Jian.

  • Hi, my name is Pulkit.

  • I work on model optimization tool kitting.

  • And let's talk about--

  • clicker doesn't seem to be working.

  • Sorry, can we go back a slide?

  • Yes.

  • Quantization Aware Training.

  • So Quantization Aware Training is a training time technique

  • for improving the accuracy of quantized models.

  • The way it works is that we introduced

  • some of the errors which actually happened

  • during quantized inference into the training process,

  • and that actually helps the trainer learn around

  • these errors and get a more accurate model.

  • Now let's just try to get a sense of why is

  • this needed in the first place.

  • So we know that quantized models,

  • they run in lower precision, and because of that,

  • it's a lossy process, and that leads to an accuracy drop.

  • And while quantized models are super fast and we want them,

  • but nobody wants an accurate model.

  • So the goal is to kind of get the best of both worlds,

  • and that's why we have this system.

  • To get a sense of why these losses get introduced,

  • one is that we actually have a--

  • once we have quantized models, these parameters

  • are in lower precision.

  • So, in a sense, you have more coarse information, fewer

  • buckets of information.

  • So that's where you have information representation

  • loss.

  • The other problem is that, when you're actually

  • doing these computations, then you have computation loss

  • when you're actually adding to coarse values instead

  • of finer buckets of values.

  • Typically, during matrix multiplication type

  • of operations, even if you're doing it at int8,

  • you accumulate these values to int32,

  • and then you rescale them back to int8,

  • so you have that rescaling loss.

  • The other thing is that, generally,

  • when we run these quantized models during inference,

  • there are various inference optimizations that

  • get applied to the graph, and because of that,

  • the training graph and the inference graph

  • can be subtly different, which also can potentially

  • introduce some of these errors.

  • And how do we recover lost accuracy?

  • Well, for starters, we try to make the training graph as

  • similar as possible to the inference graph

  • to remove these subtle differences.

  • And the other is that we actually

  • introduce these errors which actually happened

  • during inference, so the trainer learns around it

  • and machine learning does its magic.

  • So for example, when it comes to mimicking errors,

  • as you can see in the graph here,

  • you go from weights to lower precision.

  • So let's say if your weights are in floating point,

  • you go down to int8, and then you go back up

  • to floating point.

  • So in that sense, you've actually

  • mimicked what happens during inference when you're

  • executing at lower precision.

  • Then you actually do your computation,

  • and because both your inputs and your weights are at int8

  • and the losses have been introduced,

  • the computation happens correctly.

  • But then after the computation, you

  • add another fake quant to kind of drop

  • back to lower precision.

  • The other thing is we model the inference part.

  • So for example, if you noticed in the previous slide,

  • the fake quant operation came after the value activation.

  • So this is one of the optimizations

  • that happened during inference, that the value gets folded in.

  • And what we do is that when we're actually constructing

  • your graph, we make sure that these sorts of optimizations

  • get added in.

  • And let's look at the numbers.

  • So the numbers are pretty good.

  • So if you look at the slide, we're

  • almost as close as the float baseline on various version

  • models that we've tried.

  • So this is really powerful.

  • You can actually execute a model which

  • gives you nearly as good accuracy and is quantized.