字幕表 動画を再生する
Flower, dog, anxious, senior, car, Item, president, worried, avacado, Zendaya,
licorice, Nerdfighter, toothbrush, zany, expedient,
This isn't really a vlogbrothers video.
It's just a random string of words.
There aren't any coherent sentences.
It looks like John Green bot could use some help speaking a bit more like human John Green
- sounds like an excellent task for Natural Language Processing.
INTRO
Hey, I'm Jabril and welcome to Crash Course AI!
Today, we're going to tackle another hands-on lab.
Our goal today is to get John-Green-bot to produce language that sounds like human John
Green… and have some fun while doing it.
We'll be writing all of our code using a language called Python in a tool called Google
Colaboratory, and as you watch this video, you can follow along with the code in your
browser from the link we put in the description.
In these Colaboratory files, there's some regular text explaining what we're trying
to do, and pieces of code that you can run by pushing the play button.
Now, these pieces of code build on each other, so keep in mind that we have to run them in
order from top to bottom, otherwise we might get an error.
To actually run the code and experiment with changing it you'll have to either click
“open in playground” at the top of the page or open the File menu and click “Save
a Copy to Drive”.
And just an fyi: you'll need a Google account for this.
Now, we're going to build an AI model that plays a clever game of fill-in-the-blank.
We'll be able to give John-Green-bot any word prompt like “good morning,” and he'll
be able to finish the sentence.
Like any AI, John-Green-bot won't really understand anything, but AI generally does
a really good job of finding and copying patterns.
When we teach any AI system to understand and produce language, we're really asking
it to find and copy patterns in some behavior.
So to build a natural language processing AI, we need to do four things:
First, gather and clean the data.
Second, set up the model.
Third, train the model.
And fourth, make predictions.
So let's start with the first step: gather and clean the data.
In this case, the data are lots of examples of human John Green talking, and thankfully,
he's talked a lot online.,
We need some way to process his speech.
And how can we do that?
Subtitles.
And conveniently there's a whole database of subtitle files on the nerdfighteria wiki
that I pulled from.
I went ahead and collected a bunch and put them into one big file that's hosted on
crash course ai's GitHub..
This first bit of code in 1.1 loads it.
So if you wanted to try to make your AI sound like someone else, like Michael from Vsauce,
or me, this is where you'd load all that text instead.
Data gathering is often the hardest and slowest part of any machine learning project, but
in this instance its pretty straightforward.
Regardless, we still aren't done yet, now we need to clean and prep our data for our
model.
This is called preprocessing.
Remember, a computer can only process data as numbers, so we need to split our sentences
into words, and then convert our words into numbers.
When we're building a natural language processing program the term “word” may not capture
everything we need to know.
How many instances there are of a word can also be useful.
So instead, we'll use the terms lexical type and lexical token.
Now a lexical type is a word, and a lexical token is a specific instance of a word, including
any repeats.
So, for example, in the sentence:
The goal of machine learning is to make a learning machine.
We have eleven lexical tokens but only nine lexical types, because “learning” and
“machine” both occur twice.
In natural language processing, tokenization is the process of splitting a sentence into
a list of lexical tokens.
In English, we put spaces between words, so let's start by slicing up the sentence at
the spaces.
“Good morning Hank, it's Tuesday.”
would turn into a list like this.
And we would have five tokens.
However there are a few problems.
Something tells me we don't really want a lexical type for Hank-comma and Tuesday-period,
so let's add some extra rules for punctuation.
Thankfully, there are prewritten libraries for this.
Using one of those, the list would look something like this.
In this case we would have eight tokens instead of five, and tokenization even helped split
up our contraction “it's” into “it” and “apostrophe-s.”
Looking back at our code, before tokenization, we had over 30,000 lexical types.
This code also splits our data into a training dataset and a validation dataset.
We want to make sure the model learns from the training data, but we can test it on new
data it's never seen before.
That's what the validation dataset is for.
We can count up our lexical types and lexical tokens with this bit of code in box 1.3.
And it looks like we actually have about 23,000 unique lexical types.
But remember how many instances of a word can also be useful.
This code block here at step 1.4 allows us to separate how many lexical types occur more
than once twice and so on.
It looks like we've got a lot of rare words -- almost 10,000 words occur only once!
Having rare words is really tricky for AI systems, because they're trying to find
and copy patterns, so they need lots of examples of how to use each word.
Oh Human John Green.
Your master of prose.
Let's see what weird words you use.
Pisgah?
What even is a lilliputian?
Some of these are pretty tricky and are going to be too hard for John-Green-bot's AI to
learn with just this dataset
But others seem doable if we take advantage of morphology.
Morphology is the way a word gets shape-shifted to match a tense, like you'd add an “ED”
to make something past tense, or when you shorten or combine words to make them totes-amazeballs.
Dear viewers, I did not write that in the script.
In English, we can remove a lot of extra word endings, like ED, ING, or LY, through a process
called stemming.
And so, with a few simple rules, we can clean up our data even more.
I'm also going to simplify the data by replacing numbers with the hashtag or pound signs. Whatever you want to call it.
This should take care of a lot of rare words.
Now we have 3,000 fewer lexical types and only about 8,000 words only occur once.
We really need multiple examples of each word for our AI to learn patterns reliably, so
we'll simplify even more by replacing each of those 8,000 or so rare lexical tokens with
the word 'unk' or unknown.
Basically, we don't want John-Green-bot to get embarrassed if he sees a word he doesn't
know.
So by hiding some words, we can teach John-Green-bot how to keep writing when he bumps into a one-time
made-up words like zombicorns.
And just to satisfy my curiosity…
Yeah, John-Green-bot doesn't need words like “whippersnappers” or “zombification”.
John what's up with the fixation with zombies?
Anyway, we'll be fine without them.
Now that we finally have our data all cleaned and put together, we're done with preprocessing
and can move on to Step 2: setting up the model for John-Green-bot.
There are a couple key things that we need to do.
First, we need to convert the sentences into lists of numbers.
We want one word for every lexical type, so we'll build a dictionary that assigns every
word in our vocabulary a number.
Second, unlike us, the model can read a bunch of words at the same time, and we want to
take advantage of that to help John-Green-bot learn quickly.
So we're going to split our data into pieces called batches.
Here, we're telling the model to read 20 sequences (which have 35 words each) at the
same time!
Alright!
Now, it's time to finally build our AI.
We're going to program John-Green-bot with a simple language model that takes in a few
words and tries to complete the rest of the sentence.
So we'll need two key parts, an embedding matrix and a recurrent neural network or RNN.
Just like we discussed in the Natural Language Processing video last week, this is an “Encoder-Decoder”
framework.
So let's take it apart.
An embedding matrix is a big list of vectors, which is basically a big table of numbers,
where each row corresponds to a different word.
These vector-rows capture how related two words are.
So if two words are used in similar ways, then the numbers in their vectors should be
similar.
But to start, we don't know anything about the words, so we just assign every word a
vector with random numbers.
Remember we replaced all the words with numbers in our training data, so now when the system
reads in a number, it just looks up that row in the table and uses the corresponding vector
as an input.
Part 1 is done: Words become indices, which become vectors, and our embedding matrix is
ready to use.
Now, we need a model that can use those vectors intelligently.
This is where the RNN comes in.
We talked about the structure of a recurrent neural network in our last video too, but
it's basically a model that slowly builds a hidden representation by incorporating one
new word at a time.
Depending on the task, the RNN will combine new knowledge in different ways.
With John-Green-bot, we're training our RNN with sequences of words from Vlogbrothers
scripts.
Ultimately, our AI is trying to build a good summary to make sure a sentence has some overall
meaning, and it's keeping track of the last word to produce a sentence that sounds like
English.
The RNN's output after reading the final word so far in a sentence is what we'll
use to predict the next word.
And this is what we'll use to train John-Green-bot's AI after we build it. All of this is wrapped up in code block 2.3
So Part 2 is done. We've got our embedding matrix and our RNN.
Now, we're ready for Step 3: train our model.