字幕表 動画を再生する
[MUSIC PLAYING]
LAURENCE MORONEY: Hi, and welcome back
to this series on Zero to Hero with TensorFlow,
where we're looking at Natural Language Processing.
In the last couple of episodes, you
saw how to tokenize text into numeric values,
and how to use tools in TensorFlow
to regularize and pad that text.
Now that we've gotten the preprocessing out of the way,
we can next look at how to build a classifier
to recognize sentiment in text.
We'll start by using a dataset of headlines,
where the headline has been categorized
as sarcastic or not.
We'll train a classifier on this,
and it can then tell us afterwards
if a new piece of text looks like it might be sarcastic.
We'll use Rishabh Misra's dataset from Kaggle,
and you can find details on it here.
The data is nice and simple.
The is_sarcastic field is 1 if it's sarc-y, and 0 otherwise.
There is a headline where the text will train on,
and then there's a URL to the article
if you're interested in reading it.
But we're not going to use this, just the headline text.
The data is stored in JSON format
like this, pretty straightforward.
We'll have to convert it to Python format for training,
so it will look like this.
Every JSON element becomes a Python list element,
and it's all encapsulated in square brackets.
Python has a JSON toolkit that can achieve this.
And here's the complete code.
We'll go through it step by step.
First of all, we'll import the JSON library.
Then, we can load in the sarcasm JSON
file using the JSON library.
We can then create lists for the labels, headlines, and article
URLs.
And when we iterate through the JSON,
we can load the requisite values into our Python list.
Now that we have three lists, one with our labels,
one with the text, and one with the URLs,
we can start doing a familiar preprocessing on the text.
Here's the code.
By calling tokenizer.fit on texts with the headline,
we'll create tokens for every word in the corpus.
And then, we'll see them in the word index.
You can see an example of some of the words here.
So "underwood" has been tokenized
at 24127, and "skillingsbolle"-- what is that, anyway--
to 23055.
So now, we can turn our sentences
into sequences of tokens, and pad them
to the same length with this code.
If we want to inspect them, we can simply print them out.
Here you can see one tokenized sentence and the shape
of the entire corpus.
That's 26,709 sequences, each with 40 tokens.
Now, there's a problem here.
We don't have a split in the data for training and testing.
We just have a list of 26,709 sequences.
Fortunately, Python makes it super easy
for us to slice this up.
Let's take a look at that next.
So we have a bunch of sentences in a list
and a bunch of labels in a list.
To slice them into training and test sets
is actually pretty easy.
If we pick a training size, say 20,000,
we can cut it up with code like this.
So the training sentences will be the first 20,000 sliced
by this syntax, and the testing sentences
will be the remaining slice, like this.
And we can do the same for the labels
to get a training and a test set.
But there's a bit of a problem.
Remember earlier we used the tokenizer
to create a word index of every word in the set?
That was all very good.
But if we really want to test its effectiveness,
we have to ensure that the neural net only
sees the training data, and that it never sees the test data.
So we have to rewrite our code to ensure that the tokenizer is
just fit to the training data.
Let's take a look at how to do that now.
Here's the new code to create our training and test sets.
Let's look at it line by line.
We'll first instantiate a tokenizer like before,
but now, we'll fit the tokenizer on just the training sentences
that we split out earlier, instead of the entire corpus.
And now, instead of one overall set of sequences,
we can now create a set of training sequences,
and pad them, and then do exactly the same thing
for the test sequences.
It's really that easy.
But you might be wondering at this point,
we've turned our sentences into numbers,
with the numbers being tokens representing words.
But how do we get meaning from that?
How do we determine if something is sarcastic just
from the numbers?
Well, here's where the context of embeddings come in.
Let's consider the most basic of sentiments.
Something is good or something is bad.
We often see these as being opposites,
so we can plot them as having opposite directions like this.
So then what happens with a word like "meh"?
It's not particularly good, and it's not particularly bad.
Probably a little more bad than good.
So you might plot it a bit like this.
Or the phrase, "not bad," which is usually
meant to plot something as having
a little bit of goodness, but not necessarily very good.
So it might look like this.
Now, if we plot this on an x- and y-axis,
we can start to determine the good or bad sentiment
as coordinates in the x and y.
Good is 1, 0.
Meh is minus 0.4, 0.7, et cetera.
By looking at the direction of the vector,
we can start to determine the meaning of the word.
So what if you extend that into multiple dimensions instead
of just two?
What if words that are labeled with sentiments,
like sarcastic and not sarcastic,
are plotted in these multiple dimensions?
And then, as we train, we try to learn
what the direction in these multi-dimensional spaces
should look like.
Words that only appear in the sarcastic sentences
will have a strong component in the sarcastic direction,
and others will have one in the not-sarcastic direction.
As we load more and more sentences
into the network for training, these directions can change.
And when we have a fully trained network
and give it a set of words, it could look up
the vectors for these words, sum them up, and thus, give us
an idea for the sentiment.
This concept is known as embedding.
So going back to this diagram, consider
what would have happened if I said something
was "not bad, a bit meh."
If we were to sum up the vectors,
we'd have something that's 0.7 on y and 0.1 on x.
So its sentiment could be considered slightly
on the good side of neutral.
So now, let's take a look at coding this.
Here's my neural network code.
The top layer is an embedding, where
the direction of each word will be learned epoch by epoch.
After that, we pool with a global average pooling,
namely adding up the vectors, as I demonstrated earlier.
This is then fed into a common or garden deep neural network.
Training is now as simple as model.fit,
using the training data and labels,
and specifying the testing padded and labels
for the validation data.
At this URL, you can try it out for yourself.
And here, you can see the results
that I got training it for just 30 epochs.
While it was able to fit the training data to 99% accuracy,
more importantly, with the test data, that is words
that the network has never seen, it's
still got 81% to 82% accuracy, which is pretty good.
So how do we use this to establish sentiment
for new sentences?
Here's the code.
Let's create a couple of sentences
that we want to classify.
The first one looks a little bit sarcastic,
and the second one's quite plain and boring.
We'll use the tokenizer that we created earlier
to convert them into sequences.
This way, the words will have the same tokens as the training
set.
We'll then pad those sequences to be
the same dimensions as those in the training set
and use the same padding type.
And we can then predict on the padded set.
The results are like this.
The first sentence gives me 0.91, which is very close to 1,
indicating that there's a very high probability of sarcasm.
The second is 5 times 10 to the minus 6,
indicating an extremely low chance of sarcasm.
It does seem to be working.
All of this code is runnable in a Colab at this URL.
So give it a try for yourself.
You've now built your first text-classification model
to understand sentiment in text.
Give it a try for yourself, and let
us know what kind of classifiers you built.
I hope you've enjoyed this short series,
and there's more on the way.
So don't forget to hit that Subscribe button
and get the latest and greatest in AI videos
right here on the TensorFlow Channel.
[MUSIC PLAYING]