字幕表 動画を再生する
TIM DAVIS: Dance Like enables you to learn
how to dance on a mobile phone.
CHRIS MCCLANAHAN: TensorFlow flow
can take our smartphone camera and turn it
into a powerful tool for analyzing body posts.
ANDREW SELLE: We had a team at Google
that had developed an advanced model
for doing pose segmentation.
So we're able to take their implementation,
convert it into TensorFlow Lite.
Once we had it there, we could use it directly.
SHIKHAR AGARWAL: To run all the AI and machine learning models,
to detect body part, it's a very computationally expensive
process where we need to use the on device GPU,
TensorFlow Library made it possible so that we can
leverage all these resources-- the compute on the device--
and give a great user experience.
ANDREW SELLE: Teaching people to dance
is just the tip of the iceberg.
Anything that involves movement would be a great candidate.
TIM DAVIS: So that means people who have skills
can teach other people those skills.
And AI is just this layer that really just interfaces
between the two things.
When you empower people to teach people,
I think that's really when you have something
that is game changing.
NUPUR GARG: When Tim originally did this,
he did this in slow motion.
We use these models that are running on device in order
to speed up his dance performance to match
the professional dancer.
We also snapshotted a few motions
in order to understand what motions he was doing well
and what he needed to improve on applications like this can
be used on device for educational purposes
for not only dance but other applications as well.
New cutting edge models are also pushing the boundaries
of what's available on device.
BERT is a method of pre-training language representations,
which obtain state-of-the-art results on a wide array
of natural language processing tasks.
Today, we're launching MobileBERT.
BERT has been completely re-architected to not only
be smaller, but be faster without losing any accuracy.
Running MobileBERT with TensorFlow Lite
is 4.4 times faster on the CPU than BERT,
and 77% smaller while maintaining the same accuracy.
Let's take a look at a demo application.
So this is a question and answer demo application that
takes snippets from Wikipedia.
It has a user ask questions on a particular topic,
or it suggests a few preselected questions to ask.
And then searches a text corpus for the answers
to the questions all on device.
We encourage you to take a look at both of these demo
applications at our booth.
So we've worked hard to bring these features of Dance Like
and MobileBERT to your applications
by making it easy to run machine learning models on device.
In order to deploy on device, you first
need to get a TensorFlow Lite model.
Once you have the model, then you
can load it into your application,
transform the data in the way that the model requires it,
run the model, and use the resulting output.
In order to get the model, we've created a rich model
repository.
We've added many new models that can
be utilized in your applications in production right now.
These models include the basic models such as MobileNet
and inception.
It'll also include MobileBERT, Style Transfer, and DeepLab v3.
Once you have your models, you can use our TensorFlow Lite
Support Library that we're also launching this week.
It's a new library for processing and transforming
data.
Right now, it's available for Android for image models.
But we're working on adding support for iOS
as well as additional types of models.
The support library simplifies the pre-processing and
post-processing logic on Android.
This includes functions such as rotating the image 90 degrees
or cropping the image.
We're working on providing auto-generation APIs that
target your specific model and provide APIs
that are simple for your model.
With our initial launch, as I mentioned,
it'll be focused on image use cases.
However, we're working on expanding the use cases
to a broader range of models.
So let's take a look at how this looks in code.
So before a support library, in order
to add TensorFlow Lite in your application,
you needed to do all of this code,
mostly doing data pre-processing and post-processing.
However, with the use of the auto-generation support
libraries, all of this code is simplified
into five lines of code.
The first two lines are loading the model.
Then, you can load your image bit data into the model.
And it'll transform the image as required.
Next, you can run your model and it'll
output a map of the string labels
with the float probabilities.
This is how the code will look with auto-generation APIs
that we'll be launching later this year.
One of the biggest frustrations with using models
was not knowing the inputs and outputs of the models.
Now, model authors can include this metadata
with your model to have it available from the start.
This is an example of a JSON file
that the model author can package into the model.
This will be launched with the auto-generate APIs.
And all of our models in the model garden
will be updated to have this metadata.
In order to make it easy to use all of the models in our model
garden and leverage the TF support library,
we have added example applications for Android
and iOS for all of the models, and made
the applications use the TF Support Library
wherever possible.
We're also continuing to build out our demo applications
on both the Raspberry Pi and as CPU.
So now, what if your use case wasn't
covered, either by our model garden or the support library?
Revisiting all the use cases, there is a ton of use cases
that we haven't talked about in those specific models
that we listed.
So the first thing you need to do is either find a model
or generate a model yourself from TensorFlow APIs,
either keras APIs or estimator APIs.
Once you have a SavedModel, which is the unified file
format for 2.0, you can take the model,
pass it through the TensorFlow Lite Converter.
And then you'll get a TensorFlow Lite flapper for model output.
In code, it's actually very simple.
You can generate your model, save it with one line,
and use two lines of code to take in the same model
and convert it.
We also have APIs that directly convert keras models.
All the details of those are available on our website.
Over the last few months, we've worked really hard
on improving our converter.
We've added a new converter, which has better debug ability,
including source file location identification.
This means you can know exactly where in your code cannot be
converted to TF Lite.
We've also added support for Control Flow
v2, which is the default control flow in 2.0.
In addition, we're adding new operations, as well as
support for new models, including
Mask R-CNN, Faster R-CNN, MobileBert, and Deep Speech v2.
In order to enable this new feature, all you have to do
is set experimental new converter flag to true.
We encourage everyone to participate
in the testing process.
We've planned to set this new converter as a default back end
at some point in the future.
So let's look at the debug ability of this new converter.
So when running this model, it gives an error
that TF reciprocal op is neither a custom op nor a flex op.
Then it provides a stack trace, allowing
you to understand where in the code this operation is called.
And that way, you know exactly what line to address.
Once you have your TF Lite model,
it can be integrated into your application
the same way as before.
You have to load the model, pre-process it, run it,
and use the resulting output.
Let's take a look at a paradigm version of this code in Kotlin.
So the first two lines, you have to load the model.
And then you have to run it through our interpreter.
Once you have loaded the model, then you
need to initialize the input array and the output array.
The input should be a byte buffer.
And the output array needs to contain
all of the probabilities.
So it's a general float array.
Then you can run it through the interpreter
and do any post-processing as needed.
To summarize these concepts, you have the converter
to generate your model and the interpreter to run your model.
The interpreter calls into op kernels and delegates,
which I'll talk about in detail in a bit.
And guess what?
You can do all of this in a variety of language bindings.
We've released a number of new, first class language bindings,
including Swift and Objective C for iOS,
C# for Unity developers, and C for native developers on any
platform.
We've also seen a creation of a number
of community-owned language bindings
for Rust, Go, and Dart.
Now that we've discussed how TensorFlow Lite works
at a high level, let's take a closer look under the hood.
One of the first hurdles developers
face when deploying models on device is performance.
We've worked very hard, and we're
continuing to work hard on making
this easy out of the box.
We worked on improvements on the CPU, GPU, and many custom
hardwares, as well as adding tooling
to make it easy to improve your performance.
So this slide shows TF Lite's performance
at Google I/O in May.
Since then, we've had a significant performance
improvement across the board, from float models on the CPU
to models on the GPU.
Just to reemphasize how fast this is,
a floor model for MobileNet v1 takes 37 milliseconds
to run on the CPU.
If you quantize that model, it takes only 13 milliseconds
on the CPU.
On the GPU, a float model takes six milliseconds.
And on the Edge TPU, in quantized fixed point,
it takes two milliseconds.
Now let's discuss some common techniques to improve the model
performance.
There is five main approaches in order to do this--
use quantization, pruning, leverage hardware accelerators,
use mobile optimized architectures,
and per-op profiling.
The first way to improve performance
is use quantization.
Quantization is a technique used to reduce
the position of static parameters,
such as weights, and dynamic values, such as activations.
For most models, training at inference useful at 32.
However, in many use cases, using int 8 or float 16
instead of float 32 improves latency
without a significant decrease to accuracy.
Using quantization enables many hardware accelerators that only
support 8-bit computations.
In addition, it allows additional acceleration
on the GPU, which is able to do two float 16 computations
for one Float 32 computation.
We provide a variety of techniques
for performing quantization as part of the model optimization
toolkit.
Many of these techniques can be performed
after training for ease of use.
The second technique for improving model performance
is pruning.
During model pruning, we set unnecessary weight values
to zero.
By doing this, we're able to remove
what we believe are unnecessary connections between layers
of a neural network.
This is done during the training process
in order to allow the neural network
to adapt to the changes.
The resulting weight tensor is will have a lot more zeros,
and therefore will increase the sparsity of the model.
With the addition of sparse tensor representations,
the memory band width of the kernels can be reduced,
and faster kernels can be implemented for the CPU
and custom hardware.
For those who are interested, Raziel
will be talking about pruning and quantization in-depth
after lunch in the Great American Ballroom.
Revisiting the architecture diagram more closely,
the interpreter calls into op kernels and delegates.
The op colonels are highly optimized for the ARM Neon
instruction set.
And the delegates allow you to access
accelerators, such as the GPU, DSP, and Edge TPU.
So let's see how that works.
Delegates allow part or entire parts
of the graph to execute on specialized hardware instead
of the CPU.
In some cases, some operations may not
be supported by the accelerator.
So portions of that graph that can be offloaded
for acceleration are delegated.
And remaining portions of the graph are run on the CPU.
However, it's important to note that when the graph is
delegated into too many components, then
it can slow down the graph execution in some cases.
The first delegate we'll discuss is the GPU delegate,
which enables faster execution for float models.
It's up to seven times faster than the floating point
CPU implementations.
Currently, the GPU delegate uses OpenCL when possible,
or otherwise OpenGL on Android.
And uses Metal on iOS.
One trade-off with delegates is the increase
to the binary size.
The GPU delegate adds about 250 kilobytes to the binary size.
The next delegate is a Qualcomm Hexagon DSP delegate.
In order to support a greater range of devices,
in especially in mid to low-tier devices,
we have worked with Qualcomm to develop a delegate
for the hexagon chipset.
We recommend using the hexagon delegate on devices Android O
and below, and the NN API delegate,
which I'll talk about next, on devices Android P and above.
This delegate accepts integer models
and increases the binary size by about two megabytes.
And it'll be launching soon.
Finally, we have the NN API delegate, or the Neural Network
API.
The NN API delegate supports over 30 ops on the Android P,
and over 90 ops on Android Q. This delegate
accepts both float and integer models.
And it's built into Android devices
and therefore has no binary size increase.
The code for all the delegates is very similar.
All you have to do is create the delegate
and add it to the TF Lite options for the interpreter
when using it.
Here's an example with a GPU delegate.
And here's an example with an NN API delegate.
The next way to improve performance
is to choose a suitable model with a suitable model
architecture.
For many image classification tasks,
people generally use Inception.
However, when doing on device, MobileNet
is 15 times faster and nine times smaller.
And therefore, it's important to investigate
the trade-off between the accuracy and the model
performance and size.
This applies to other applications as well.
Finally, you want to ensure that you're
benchmarking and validating all of your models.
We offer simple tools to enable this
for per-op profiling, which helps
determine which ops are taking the most computation time.
This slide shows a way to execute the per-op profiling
tool through the command line.
This is what our tool will output when you're doing
per-op profiling for a model.
And it enables you to narrow down your graph execution
and go back and tune performance bottlenecks.
Beyond performance, we have a variety of techniques
relating to op coverage.
The first allows you to utilize TensorFlow ops that are not
natively supported in TF Lite.
And the second allows you to reduce your binary size
if you only want to include a subset of ops.
So one of the main issues that users
face when converting a model from TensorFlow to TensorFlow
Lite is unsupported ops.
TF Lite has native implementations
for a subset of the TensorFlow ops
that are optimized for mobile.
In order to increase op coverage,
we have added a feature called TensorFlow Lite Select, which
adds support for many of the TensorFlow ops.
The one trade-off is that it can increase binary size
by six megabytes, because we're pulling in the full TensorFlow
runtime.
This is a code snippet showing how you
can use TensorFlow Lite Select.
You have to set the target_spec.supported_ops
to include both built-in and select ops.
So built-in ops will be used when possible in order
to utilize optimized kernels.
And select ops will be used in all other cases.
On the other hand, for TF Lite developers
who deeply care about their binary footprint,
we've added a technique that we call selective registration,
which only includes the ops that are required by the model.
Let's take a look at how this works in code.
You create a custom op resolver that you
use in place of TF Lite to build an op resolver.
And then in your build file, you specify
your model and the custom op resolver that you created.
And TF Lite will scan over your model
and create a registry of ops contained within your model.
When you build the interpreter, it'll
only include the ops that are required by your model,
therefore reducing your overall binary size.
This technique is similar to the technique that's
used to provide support for custom operations, which
are user-provided implementations for ops
that we do not support as built-in ops.
And next, we have Pete talking about microcontrollers.
PETE WARDEN: As you've seen, TensorFlow
has had a lot of success in mobile devices,
like Android and iOS.
We're in over three billion devices in production.
Oh, I might actually have to switch back to--
let's see-- yes, there we go.
So what is really interesting, though,
is that there were actually over 250 billion
microcontrollers out there in the world already.
And you might not be familiar with them
because they tend to hide in plain sight.
But these are things that you get
in your cars and your washing machines,
in almost any piece of electronics these days.
They are extremely small.
They only have maybe tens of kilobytes of RAM and Flash
to actually work with.
They often don't have a proper operating system.
They definitely don't have anything like Linux.
And they are incredibly resource-constrained.
And you might think, OK, I've only
got tens of kilobytes of space.
What am I going to be able to do with this?
A classic example of using microcontrollers is actually--
and you'll have to forgive me if anybody's phone goes off--
but, OK Google.
That's driven by a microcontroller
that runs always on DSP.
And the reason that it's running on a DSP,
even though you have this very powerful ARM CPU sitting there
is that a DSP only uses tiny amounts of battery.
And if you want your battery to last
for more than an hour or so, you don't want the CPU
on all the time.
You need something that's going to be able to sit there and sip
almost no power.
So the setup that we tend to use for that is you
have a small, comparatively low accuracy model
that's always running on this very low energy DSP that's
listening out for something that might
sound a bit like OK Google.
And then if it thinks it's heard that,
it actually wakes up the main CPU,
which is much more battery hungry,
to run an even more elaborate model to just double check
that.
So you're actually able to get this cascade of deep learning
models to try and detect things that you're interested in.
And this is a really, really common pattern.
Even though you might not be able to do an incredibly
accurate model on a microcontroller or a DSP,
if you actually have this kind of architecture,
it's very possible to do really interesting and useful
applications and keep your battery life actually alive.
So we needed a framework that would actually
fit into this tens of kilobytes of memory.
But we didn't want to lose all of the advantages we
get from being part of this TensorFlow Lite
ecosystem and this whole TensorFlow ecosystem.
So what we've actually ended up doing
is writing an interpreter that fits
within just a few kilobytes of memory,
but still uses the same APIs, the same kernels, the same file
buffer format that you use with regular TensorFlow Lite
for mobile.
So you get all of these advantages, all
of these wonderful tooling things
that Nupur was just talking about that are coming out.
But you actually get to deploy on these really tiny devices.
[VIDEO PLAYBACK]
- Animation.
OK, so now it's ready.
And so it even gives you instructions.
So instead of listening constantly,
which we thought some people don't like the privacy side
effects of it, is you have to press the button A here.
And then you speak into this microphone
that I've just plugged into this [INAUDIBLE] port here.
It's just a standard microphone.
And it will display a video and animation and audio.
So let's try it out.
I'm going to press A and speak into this mic.
Yes.
Moo.
Bam.
- You did it.
- Live demo.
- So that's what we wanted to show.
And it has some feedback on the screen.
It shows the version, it shows what we're using.
And this is all hardware that we have now-- battery power,
[INAUDIBLE] power.
- Yes, yes, yes, yes, yes.
This is all battery powered.
[END PLAYBACK]
PETE WARDEN: So what's actually happening
though is it plays an animation when she says the word yes,
because it's recognized that.
There's actually an example of using
TensorFlow Lite for microcontrollers,
which is able to recognize simple words like yes or no.
And it's really a tutorial on how you can create
something that's very similar to the OK Google model
that we've run on DSP and phones to recognize short words,
or even do things like recognize,
if you want to recognize breaking glass,
if you want to recognize any other audio noises,
there's a complete tutorial that you can actually grab
and then you could deploy on these kind of microcontrollers.
And if you're lucky and you stop by the TensorFlow Lite booth,
we might even have a few of these microcontrollers left
to give away from Adafruit.
So I know some of you out there in the audience already
have that box, but thanks to the generosity of ARM,
we've actually been able to hand some of those out.
So come by and check that out.
So let's see if I can actually--
yes.
And the other good thing about this
is that you can use this on a whole variety
of different microcontrollers.
We have an official Arduino library.
So if you're using the Arduino IDE,
you can actually grab it immediately.
Again, AV-- much harder than AI.
Let's see.
So we'll have the slides available.
So you can grab them.
But we actually have a library that you can grab directly
through the Arduino IDE.
And you just choose it like you would any other library,
if you're familiar with that.
But we also have it available through systems like Mbed,
if you're used to that on the ARM devices.
And through places like SparkFun and Adafruit,
you can actually get boards.
And what this does--
you'll have to trust me because you
won't be able to see the LED.
But if I do a W gesture, it lights up the red LED.
If I do an O, it lights up the blue LED.
Some of you in the front may be able to vouch for me.
And then if I do an L--
see if I get this right--
it lights up the yellow LED.
As you can tell, I'm not an expert wizard.
We might need to click on the play focus.
Let's see this.
Fingers crossed.
Yay.
I'm going to skip past--
oh my god, we have audio.
This is amazing.
It's a Halloween miracle.
Awesome.
So, yes, you can see here--
Arduino.
A very nice video from them.
And they have some great examples out there too.
You can just pick up their board and get
running in a few minutes.
It's pretty cool.
And as I was mentioning with the magic wand,
here we're doing an accelerometer gesture
recognition.
You can imagine there's all sorts of applications for this.
And the key thing here is this is running on something that's
running on a coin battery, and can run on a coin battery
for days or even weeks or months,
if we get the power optimization right.
So this is really the key to this ubiquitous ambient
computing that you might be hearing a lot about.
And what other things can you do with these kind of MCUs?
They are really resource limited.
But you can do some great things like simple speech recognition,
like we've shown.
We have a demo at the booth of doing person detection using
a 250 kilobyte MobileNet model that just detects
whether or not there's a person in front of the camera, which
is obviously super useful for all sorts of applications.
We also have predictive maintenance,
which is a really powerful application.
If you think about machines in factories,
even if you think about something like your own car,
you can tell when it's making a funny noise.
And you might need to take it to the mechanics.
Now if you imagine using machine learning models
on all of the billions of machines that are running
in factories and industry all around the world,
you can see how powerful that can actually be.
So as we mentioned, we've got these examples
out there now as part of TensorFlow Lite
that you can run on Arduino, SparkFun, Adafruit,
all these kinds of boards, recognizing yes/no
with the ability to retrain using TensorFlow
for your own words you care about.
Doing person detection is really interesting
because we've trained it for people,
but it will actually also work for a whole bunch
of other objects in the COCO data set.
So if you want to detect cars instead of people,
it's very, very easy to just re-target it for that.
And gesture recognition.
We've been able to train it to recognize
these kinds of gestures.
Obviously, if you have your own things
that you want to recognize through accelerometers,
that's totally possible to do as well.
So one of the things that's really
helped us do this has been our partnership with ARM,
who designed all the devices that we've actually
been showing today.
So maybe the ARM people up front,
if you can just give a wave so people can find you.
And thank you.
They've actually been contributing a lot of code.
And this has been a fantastic partnership for us.
And stay tuned for lots more where that came from.
So that's it for the microcontrollers.
Just to finish up, I want to cover a little bit
about where TensorFlow Lite is going in the future.
So what we hear more than anything is people
want to bring more models to mobile and embedded devices.
So more ops, more supported models.
They want their models to run faster.
So we're continuing to push on performance improvements.
They want to see more integration with TensorFlow
and things like TensorFlow Hub.
And easier usage of all these, which
means better documentation, better examples, better
tutorials.
On-device training and personalization
is a really, really interesting area
where things are progressing.
And we also really care about trying to figure out
where your performance is going, and actually trying
to automate the process of profiling and optimization,
and helping you do a better job with your models.
And to help with all of that, we also
have a brand new course that's launched on Udacity
aimed at TensorFlow Lite.
So please check that out.
So that's it from us.
Thank you for your patience through all
of the technical hiccups.
I'm happy to answer any of your questions.
I think we're going to be heading over to the booth
after this.
So we will be there.
And you can email us at tflite@tensorflow.org
if you have anything that you want to ask us about.
So thank you so much.
I'll look forward to chatting.
[APPLAUSE]