コカ・コーラ社での応用AI（TensorFlow Dev Summit 2018 (Applied AI at The Coca-Cola Company (TensorFlow Dev Summit 2018))

字幕表動画を再生する

♪ (music) ♪
Alright.
So yes, I'm Patrick. I'm a Solutions Strategist from Coca-Cola.
Today I'm going to share with you how we're using TensorFlow
to support some of our largest, most popular
digital marketing programs in North America.
So, we're going to take the TensorFlow Dev Summit
off to a marketing tangent for a minute,
before we come back.
Alright, so as a background:
What is proof of purchase and what is its relationship to marketing?
As an example, back in the day,
folks could clip the barcodes off their cereal boxes
and then mail these barcodes back into the cereal company
to receive a reward--
some kind of coupon or prize back through the mail.
And this is basic loyalty marketing.
Brands--in this case, the cereal company--
rewarding consumers who purchase,
and at the same time
opening up a line of communication between the brand and the consumer.
Over the last 15-odd years of marketing digitization,
this concept has evolved into digital engagement marketing--
rewarding consumers in the moment, in real time,
through web and mobile channels.
But often, proof of purchase is still an important component of that experience.
We have a very active digital engagement marketing program at Coca-Cola.
Through proof of purchase
our consumers can earn a magazine subscription,
or the chance to win a cruise,
or a vintage vending machine.
And this is what proof of purchase looks like at Coke.
Underneath our bottle caps and inside those cardboard fridge packs
that you can buy at the grocery store
we've printed these 14-character product pincodes.
These are unique to every product
and these are what our consumers enter into our promotions.
You can enter these in by hand,
but on your mobile device, you can scan them.
This had been the holy grail of marketing IT at Coke for a long time.
We looked at both commercial
and open source optical character recognition software, OCR,
but it could never read these codes very well.
The problem has to do with the [failure] of the code.
So, these are 4 x 7 dot matrix printed.
The printer head is about an inch off the surface
of the cap and fridge pack,
and these things are flying underneath that printer head at a very rapid rate.
So this creates a lot of visual artifacts--
things like character skew and pixel drift--
things that normal OCR can't handle very well.
We knew that if we wanted to unlock this experience for our consumers,
we were going to have to build something from scratch.
When I look at these codes,
a couple of characteristics jump out at me.
We're using a small alphabet-- let's say, ten characters--
and there's a decent amount of variability in the presentation of these characters.
This reminds me of MNIST--
the online database of 60,000 handwritten digit images.
And Convolutional Neural Networks, or ConvNets,
are particular good at extracting the text from these images.
I'll probably tell you all something you already know,
but here we go.
ConvNets work by taking an image and initially breaking it down
into many smaller pieces,
and then detecting varied granular features within these pieces--
things like edges and textures and colors.
And these varied granular feature activations are pulled up, often,
into a more general feature layer,
and that's filtered, and those feature activations are pulled up, and so on,
until the output of the neural net
is run through a softmax function,
which creates a probability distribution
of a likelihood that a set of objects exist within the image.
But ConvNets have a really nice property
in that they handle the translation
and variant nature of the images very well.
That means, from our perspective,
they can handle the tilt and twist of a bottle cap held in someone's hand.
It's perfect.
So, this is what we're going to use, we're going to move forward.
So, now we need to build our platform, and that begins with training--
the beating heart of any applied AI solution.
And we knew that we needed high-quality images
with accurate labels of the codes within those images,
and we likely needed a lot of them.
We started by generating a synthetic data set
of randomly generated strings
that were superimposed over blank bottle cap images,
which were then, in turn, superimposed over random backgrounds.
This gave us a base for transfer learning in the future,
once we created our real-world data set.
And we did that by doing a production run of caps and fridge packs
out of printing facilities,
and then distributing those to multiple third-party suppliers,
along with some custom tools that we created
to allow them to scan a cap and then label it with a pincode.
But a really important component to this process
was an existing pincode validation service
that we've had in production for a long time to support our programs.
So, any time a trainer labeled an image,
we'd send that label through our validation service,
and if it was a valid pin code,
we knew we had an accurate label.
So, this gets our model trained, and now we need to release it to the wild.
We had some pretty aggressive performance requirements.
We wanted one second average processing time,
we wanted 95% accuracy at launch,
but we also wanted to host the model remotely for the Web,
and embed it natively on mobile devices to support mobile apps.
So, this means that our model has to be small--
small enough to support over-the-air updates
as the model improves over time.
And to help us improve that model over time
we created an active learning UI, User Interface,
that allows our consumers to train the model
once it's in production.
And that's what this looks like.
So, if I, as a consumer, scan a cap,
and the model cannot infer a valid pincode,
it sends down to the UI a per character confidence
of every character at every position.
And this can be used to render a screen
much like what you see here.
So I, as a user,
am only directed to address those particularly low confidence characters.
I see a couple of red characters there-- I tap them, it brings up a keyboard,
I correct them, then I'm entered into my promotion.
It's a good user experience for me.
I scan a code and I'm only a few taps away from being entered into a promotion,
but on the back end, we now have extremely valuable data for training,
because we have the image that created the invalid difference to begin with,
as well as the user corrected label
that they needed to correct to get into the promotion.
So, we can throw this into the hopper
for future rounds of training to improve the model.
When you put it all together this is what it looks like.
The user takes a picture of a cap,
the region of interest is found, the image is normalized.
It's then sent into our ConvNet model,
the output of which is a character probability matrix.
This is the per character confidence of every character at every position.
That is then further analyzed to create a top-ten prediction.
Each one of those predictions is vetted to our Bingo validation service.
The first one that is valid-- which is often the first one on the list--
is entered into the promotion,
and if none of them are valid,
our user sees the active learning experience.
So, our model development effort went through three [vague] iterations.
Initially, in an effort to keep the model size small upfront,
our data science team used binarization to normalize the image.
But this was too lossy.
It didn't produce enough data to create an accurate model.
So, they switched to best channel conversion,
which got the accuracy up,
but then the model size grew too large to support over-the-air updates.
So, at this point, our team starts over. (chuckles)
They just completely re-architect the ConvNet using SqueezeNet,
which is designed to reduce model size
by reducing the number of learnable parameters within the model.
But, after making this move, we had a problem.
We started to experience internal covariate shift,
which is the result of reducing the number of learnable parameters.
And that means that very small changes to upstream parameter values
cascaded to huge gyrations in downstream parameter values.
So, this slowed our training process considerably,
because we had to grind through this covariate shift
in order to get the model to converge,
if it would converge at all.
So, to solve this problem,
our team introduced batch normalization,
which sped up training, it got the model to converge,
and now we're exactly where we want to be.
We have a 5MB model,
it's a 25-fold decrease from where we started,
with accuracy greater than 95%.
And the results are impressive.
These are some screen grabs from a test site that I built,
and you can see across the top row
how the model handles different types of occlusion.
It also handles translation-- tilting the cap,
rotation-- twisting the cap,
and camera focus issues.
So, you can try this out for yourself.
I'm going to pitch the newly-launched Coca-Cola USA app.
It hit Android and iPhone app stores a couple of days ago.
It does many things, but you can use it to scan a code.
You can also go online with your mobile browser
to coke.com/rewards,
and take a picture of a code to be entered into a promotion.
Alright, so some quick shout-outs-- I can't not mention these folks.
Quantiphi is the data science team that built our image processing pipeline
and the pincode recognition model.
Digital Platforms & Innovation at Coke, led by Ellen Duncan.
She spearheaded this from the marketing side.
And then, my people in IT.
My colleague, Andy Donaldson, shepherded this into production.
So, thank you.
It's been a privilege to speak with you.
I covered a lot of ground in ten short minutes.
There's a lot of stuff I didn't talk about.
So, if you have any questions or any follow-up,
please feel free to reach out to me on Twitter @patrickbrandt.
You can also hit me up on LinkedIn.
That shortcode URL will get you to my profile page that's wpb.is/linkedin.
You can also read an article I published last year on this solution.
It's a Google developers' blog
and you can get there at wpb.is/tensorflow.
Alright. So, thank you.
Next up is Alex.
And Alex is going to talk to us about Applied ML in Robotics.
(applause)
♪ (music) ♪