字幕表 動画を再生する 英語字幕をプリント [MUSIC PLAYING] JIAN LI: Hello, everyone. My name's Jian. I'm a software engineer on the TensorFlow team. Today, my colleague Pulkit and I will be talking about the TensorFlow model optimization toolkit. Model optimization means transforming your machine learning models to make them efficient to execute. That means faster computation as well as a lower memory, storage, and battery usage. And it is focused on inference instead of training. And because of the above mentioned benefits, optimization can unlock use cases that are otherwise impossible. Examples include speech recognition, face unlock, object detection, music recognition, and many more. The model optimization toolkit is a suite of TensorFlow and TensorFlow Lite tools that make it simple to optimize your model. Optimization is an active research area and there are many techniques. Our goal is to prioritize the ones that are general across model architectures and across various hardware accelerators. There are two major techniques in the toolkit, quantization and pruning. Quantization stimulates flow calculation in lower bits, and pruning forces zero interconnection. Today we are going to focus on quantization and we'll briefly talk about pruning. Now let's take a closer look at quantization. Quantization is a general term describing technologies that reduce the numerical precision of static parameters and execute the operations in lower precision. Precision reduction makes the model smaller, and a lower precision execution makes the model faster. Now let's dig a bit more onto how we perform quantization. As a concrete example, imagine we have a tensor with float values. In most cases, we are wasting most of the representation space in the float number line. If we can find a linear transformation that maps the float value onto int8, we can reduce the model size by a factor of four. Then computations can be carried out between int8 values, and that is where the speed up comes from. So there are two main approaches to do quantization, post training and during training. Post training operates on a already trained model and is built on top of TensorFlow Lite converter. During training, quantization performs additional weight fine-tuning, and since training is required, it is a build on top of a TensorFlow Keras API. Different techniques offers a trade off between ease of use and model accuracy. The most easy to use technique is the dynamic range quantization, which doesn't require any data. There can be some accuracy loss but we get a two to three times speed up. Because floating point calculation is still needed for the activation, it's only meant to run on CPU. If we want extra speed up on CPU or want to run the model on hardware accelerators, we can use integer quantization. It runs a small set of unlabeled calibration data to collect the min-max range on activation. This removes the floating point calculation in the computer graph, so there is a speed up on CPU. But more importantly, it allows the model to run on hardware accelerators such as DSP and TPU, which are faster and more energy efficient than CPU. And if accuracy is a concern, we can use Quantization Aware Training to fine-tune the weights. It has all the benefits of integer quantization, but it requires training. Now let's have a operator level breakdown on the post training quantization. Dynamic range quantization is fully supported and integer quantization is supported for most of the operators. The missing piece is the recurrent neural network support, and that blocks use cases such as speech and language where a context is needed. To unblock those use cases, we have recently added a recurrent neural network quantization and built a turnkey solution through the post training API. RNN model build with Keras 2.0 can be converted and quantized with the post training API. This slide shows the end to end workflow in the post training setup. We create the TensorFlow Lite converter and load the saved RNN model. We then set the post training optimization flags and provide calibration data. After that, we are able to call the convert method to convert and quantized the model. This is the exact same API and workflow for models without RNN, so there is no API change for the end users. Let's take a look at the challenges of the RNN quantization. Quantization is a lossy transformation. RNN cell has a memory state that persists across multiple timestamps, so quantization errors can accumulate in both the layer direction and the time direction. RNN cell contains many calculations, and determining the number of bits and the scale is a global optimization problem. Also, quantized operations are restricted by hardware capabilities. Some operations are not allowed on certain hardware platforms. We solved the challenge and created the quantization spec for RNN. The full spec is quite complicated, and this slide shows this spec by zooming into one of the LSTM gates. As I mentioned, there are many calculations in one cell. To balance performance and accuracy, we keep eight bit calculations as much as possible and it only goes to higher bits when required by accuracy. As you can see from the diagram, metrics related operations are in 8 bit, and web related operations are a mixture of 8 bit and 16 bits. And please note, the use of higher bits is only internal to the cell. The input and output activation for RNN cell are all 8 bits. Now we see the details of RNN quantization. Let's look at the accuracy and the performance. This table shows some published accuracy numbers on a few data sets. It's a speech recognition model that consists of 10 layers of quantized LSTM. As you can see, integer quantized model has the same accuracy as the dynamic range quantized model, and the accuracy loss is negligible compared with the float case. Also, this is a permanent model, so RNN quantization works with pruning as well. As expected, there is a four time model size reduction because static weights are quantized to 8 bits. Performance-wise, there is a two to four times speed up on a CPU and a more than 10 times speed up on DSP and TPU. So those numbers are consistent with the numbers from other operators. So here are the main takeaways. TensorFlow now supports the RNN/LSTM quantization. It is a turnkey solution through the post training API. It enables smaller, faster, and a more energy efficient execution that can run on DSP and TPU. There are already production models that use the quantization. And please check the link for more details on the use cases. Looking forward, our next step will be to expand quantization to other recurrent neural networks, such as the GRU and SRU. We also plan to add Quantization Aware Training for RNN. Now I'll hand it over to my colleague Pulkit. Thank you. PULKIT BHUWALKA: Thanks. Thanks Jian. Hi, my name is Pulkit. I work on model optimization tool kitting. And let's talk about-- clicker doesn't seem to be working. Sorry, can we go back a slide? Yes. Quantization Aware Training. So Quantization Aware Training is a training time technique for improving the accuracy of quantized models. The way it works is that we introduced some of the errors which actually happened during quantized inference into the training process, and that actually helps the trainer learn around these errors and get a more accurate model. Now let's just try to get a sense of why is this needed in the first place. So we know that quantized models, they run in lower precision, and because of that, it's a lossy process, and that leads to an accuracy drop. And while quantized models are super fast and we want them, but nobody wants an accurate model. So the goal is to kind of get the best of both worlds, and that's why we have this system. To get a sense of why these losses get introduced, one is that we actually have a-- once we have quantized models, these parameters are in lower precision. So, in a sense, you have more coarse information, fewer buckets of information. So that's where you have information representation loss. The other problem is that, when you're actually doing these computations, then you have computation loss when you're actually adding to coarse values instead of finer buckets of values. Typically, during matrix multiplication type of operations, even if you're doing it at int8, you accumulate these values to int32, and then you rescale them back to int8, so you have that rescaling loss. The other thing is that, generally, when we run these quantized models during inference, there are various inference optimizations that get applied to the graph, and because of that, the training graph and the inference graph can be subtly different, which also can potentially introduce some of these errors. And how do we recover lost accuracy? Well, for starters, we try to make the training graph as similar as possible to the inference graph to remove these subtle differences. And the other is that we actually introduce these errors which actually happened during inference, so the trainer learns around it and machine learning does its magic. So for example, when it comes to mimicking errors, as you can see in the graph here, you go from weights to lower precision. So let's say if your weights are in floating point, you go down to int8, and then you go back up to floating point. So in that sense, you've actually mimicked what happens during inference when you're executing at lower precision. Then you actually do your computation, and because both your inputs and your weights are at int8 and the losses have been introduced, the computation happens correctly. But then after the computation, you add another fake quant to kind of drop back to lower precision. The other thing is we model the inference part. So for example, if you noticed in the previous slide, the fake quant operation came after the value activation. So this is one of the optimizations that happened during inference, that the value gets folded in. And what we do is that when we're actually constructing your graph, we make sure that these sorts of optimizations get added in. And let's look at the numbers. So the numbers are pretty good. So if you look at the slide, we're almost as close as the float baseline on various version models that we've tried. So this is really powerful. You can actually execute a model which gives you nearly as good accuracy and is quantized.