字幕表 動画を再生する
[MUSIC PLAYING]
MARK OMERNICK: Good morning.
My name is Mark Omernick, and I'm a software engineer
with Google AI.
Today I'll be talking about two projects.
The first is enhanced Unicode support across the TensorFlow
code base.
And the second is a new tensor type
called RaggedTensors, intended to officially represent
sequence data.
First, I'll take a quick look at we've improved Unicode support
in TensorFlow.
Unicode is a way of encoding characters
from nearly every written language using
sequences of bytes.
Here, these four characters can be represented
as four triplets of bytes.
A string containing these four characters
would be 12 bytes long in total.
Previously, TensorFlow assumed that strings
were indexed by individual bytes, ASCII style.
That led to issues like this, where strings split would split
Unicode characters below the character boundary,
and substr would index by bytes instead of characters.
However, now that we've added Unicode support to TensorFlow,
we can correctly handle multi-byte characters.
Unicode_split now splits into proper triplets, and substr,
with the UTF8_CHAR tag, indexes by UTF-8 characters.
In addition to string splitting, TensorFlow now
supports many other Unicode-aware string
operations, from Unicode encoding and decoding
to string length analysis.
For the second part of this presentation,
I'd like to introduce a new tensor type, RaggedTensors,
that we designed to handle text and other variable length
sequences.
RaggedTensors are a native representation
for sequences of varying shape.
Here you can see a RaggedTensor containing three batch items.
The first, a tensor with two strings,
the second, a tensor with four strings, and the third,
a tensor with one string without any additional padding
or user-facing logic.
RaggedTensors are different from SparseTensors in one key way.
SparseTensors make the assumption
that the underlying dense tensor is regularly shaped
and unmentioned values are missing.
RaggedTensors, on the other hand, make no such assumption.
Here, for instance, the SparseTensor
interprets the first batch element as John, null, null,
while the RaggedTensor interprets it as simply John.
A RaggedTensor can contain any number of irregular dimensions.
Here, for instance, we have a three-dimensional RaggedTensor
that represents every character in every token
in a batch of three sequences.
There are variable numbers of tokens
per sequence and variable numbers of characters
per token.
But with RaggedTensors, you don't
need to worry about maximum sizes, padding,
or anything else.
RaggedTensors are a native TensorFlow representation
for any varying length sequence of data,
from words to images and beyond.
You could imagine using RaggedTensors
to contain the set of still frames
in a batch of videos, where each video is a different length.
So how do you use RaggedTensors?
Let's start with building them.
To create a RaggedTensor, you'll need a flat tensor
of values and some specification on how to split
those values into batch items.
Once you have a RaggedTensor, you
can perform standard tensor operations
on it, like concatenation and slicing,
even within irregular dimensions.
RaggedTensors are natively supported
by over 100 TensorFlow core ops ranging from math ops
through string handling to reductions.
And if you need to operate on each value in a RaggedTensor,
we provide a native map function.
You can use this to apply ops or even entire subgraphs
to every value in a RaggedTensor.
To illustrate how to use RaggedTensors in a model,
let's consider using a bag of character level embeddings
to create a token level embedding.
We start by taking a RaggedTensor of tokens
separated by batch and applying unicode_decode, a new op that
outputs a RaggedTensor of Unicode code points separated
by batch and token.
We can then use map_flat_values to get an embedding
for each of these code points.
Now, char_embedding is a four-dimensional RaggedTensor
with batch, token, sentence, and embedding dimensions.
We can convert it into a standard four-dimensional
tensor, reshape it, so that it is token_major,
run a convolution over each character in each token,
then reshape it back into a dense 40 tensor with batch,
token, sentence, and embedding dimensions.
That 40 dense tensor can be converted back
into a 40 RaggedTensor, which removes any padding.
This RaggedTensor can be reduced, via reduce_mean,
to create per token embeddings.
At the end, we have a tensor of embeddings, one for each token,
built from characters without any extraneous padding.
For more information, you can take a look at the tutorials
available here.
Please try them out and give your feedback on GitHub.
Thank you.
[MUSIC PLAYING]