Placeholder Image

字幕表 動画を再生する

  • Flower, dog, anxious, senior, car, Item, president, worried, avacado, Zendaya,

  • licorice, Nerdfighter, toothbrush, zany, expedient,

  • This isn't really a vlogbrothers video.

  • It's just a random string of words.

  • There aren't any coherent sentences.

  • It looks like John Green bot could use some help speaking a bit more like human John Green

  • - sounds like an excellent task for Natural Language Processing.

  • INTRO

  • Hey, I'm Jabril and welcome to Crash Course AI!

  • Today, we're going to tackle another hands-on lab.

  • Our goal today is to get John-Green-bot to produce language that sounds like human John

  • Greenand have some fun while doing it.

  • We'll be writing all of our code using a language called Python in a tool called Google

  • Colaboratory, and as you watch this video, you can follow along with the code in your

  • browser from the link we put in the description.

  • In these Colaboratory files, there's some regular text explaining what we're trying

  • to do, and pieces of code that you can run by pushing the play button.

  • Now, these pieces of code build on each other, so keep in mind that we have to run them in

  • order from top to bottom, otherwise we might get an error.

  • To actually run the code and experiment with changing it you'll have to either click

  • open in playgroundat the top of the page or open the File menu and clickSave

  • a Copy to Drive”.

  • And just an fyi: you'll need a Google account for this.

  • Now, we're going to build an AI model that plays a clever game of fill-in-the-blank.

  • We'll be able to give John-Green-bot any word prompt likegood morning,” and he'll

  • be able to finish the sentence.

  • Like any AI, John-Green-bot won't really understand anything, but AI generally does

  • a really good job of finding and copying patterns.

  • When we teach any AI system to understand and produce language, we're really asking

  • it to find and copy patterns in some behavior.

  • So to build a natural language processing AI, we need to do four things:

  • First, gather and clean the data.

  • Second, set up the model.

  • Third, train the model.

  • And fourth, make predictions.

  • So let's start with the first step: gather and clean the data.

  • In this case, the data are lots of examples of human John Green talking, and thankfully,

  • he's talked a lot online.,

  • We need some way to process his speech.

  • And how can we do that?

  • Subtitles.

  • And conveniently there's a whole database of subtitle files on the nerdfighteria wiki

  • that I pulled from.

  • I went ahead and collected a bunch and put them into one big file that's hosted on

  • crash course ai's GitHub..

  • This first bit of code in 1.1 loads it.

  • So if you wanted to try to make your AI sound like someone else, like Michael from Vsauce,

  • or me, this is where you'd load all that text instead.

  • Data gathering is often the hardest and slowest part of any machine learning project, but

  • in this instance its pretty straightforward.

  • Regardless, we still aren't done yet, now we need to clean and prep our data for our

  • model.

  • This is called preprocessing.

  • Remember, a computer can only process data as numbers, so we need to split our sentences

  • into words, and then convert our words into numbers.

  • When we're building a natural language processing program the termwordmay not capture

  • everything we need to know.

  • How many instances there are of a word can also be useful.

  • So instead, we'll use the terms lexical type and lexical token.

  • Now a lexical type is a word, and a lexical token is a specific instance of a word, including

  • any repeats.

  • So, for example, in the sentence:

  • The goal of machine learning is to make a learning machine.

  • We have eleven lexical tokens but only nine lexical types, becauselearningand

  • machineboth occur twice.

  • In natural language processing, tokenization is the process of splitting a sentence into

  • a list of lexical tokens.

  • In English, we put spaces between words, so let's start by slicing up the sentence at

  • the spaces.

  • Good morning Hank, it's Tuesday.”

  • would turn into a list like this.

  • And we would have five tokens.

  • However there are a few problems.

  • Something tells me we don't really want a lexical type for Hank-comma and Tuesday-period,

  • so let's add some extra rules for punctuation.

  • Thankfully, there are prewritten libraries for this.

  • Using one of those, the list would look something like this.

  • In this case we would have eight tokens instead of five, and tokenization even helped split

  • up our contractionit's” intoitandapostrophe-s.”

  • Looking back at our code, before tokenization, we had over 30,000 lexical types.

  • This code also splits our data into a training dataset and a validation dataset.

  • We want to make sure the model learns from the training data, but we can test it on new

  • data it's never seen before.

  • That's what the validation dataset is for.

  • We can count up our lexical types and lexical tokens with this bit of code in box 1.3.

  • And it looks like we actually have about 23,000 unique lexical types.

  • But remember how many instances of a word can also be useful.

  • This code block here at step 1.4 allows us to separate how many lexical types occur more

  • than once twice and so on.

  • It looks like we've got a lot of rare words -- almost 10,000 words occur only once!

  • Having rare words is really tricky for AI systems, because they're trying to find

  • and copy patterns, so they need lots of examples of how to use each word.

  • Oh Human John Green.

  • Your master of prose.

  • Let's see what weird words you use.

  • Pisgah?

  • What even is a lilliputian?

  • Some of these are pretty tricky and are going to be too hard for John-Green-bot's AI to

  • learn with just this dataset

  • But others seem doable if we take advantage of morphology.

  • Morphology is the way a word gets shape-shifted to match a tense, like you'd add anED

  • to make something past tense, or when you shorten or combine words to make them totes-amazeballs.

  • Dear viewers, I did not write that in the script.

  • In English, we can remove a lot of extra word endings, like ED, ING, or LY, through a process

  • called stemming.

  • And so, with a few simple rules, we can clean up our data even more.

  • I'm also going to simplify the data by replacing numbers with the hashtag or pound signs. Whatever you want to call it.

  • This should take care of a lot of rare words.

  • Now we have 3,000 fewer lexical types and only about 8,000 words only occur once.

  • We really need multiple examples of each word for our AI to learn patterns reliably, so

  • we'll simplify even more by replacing each of those 8,000 or so rare lexical tokens with

  • the word 'unk' or unknown.

  • Basically, we don't want John-Green-bot to get embarrassed if he sees a word he doesn't

  • know.

  • So by hiding some words, we can teach John-Green-bot how to keep writing when he bumps into a one-time

  • made-up words like zombicorns.

  • And just to satisfy my curiosity

  • Yeah, John-Green-bot doesn't need words likewhippersnappersorzombification”.

  • John what's up with the fixation with zombies?

  • Anyway, we'll be fine without them.

  • Now that we finally have our data all cleaned and put together, we're done with preprocessing

  • and can move on to Step 2: setting up the model for John-Green-bot.

  • There are a couple key things that we need to do.

  • First, we need to convert the sentences into lists of numbers.

  • We want one word for every lexical type, so we'll build a dictionary that assigns every

  • word in our vocabulary a number.

  • Second, unlike us, the model can read a bunch of words at the same time, and we want to

  • take advantage of that to help John-Green-bot learn quickly.

  • So we're going to split our data into pieces called batches.

  • Here, we're telling the model to read 20 sequences (which have 35 words each) at the

  • same time!

  • Alright!

  • Now, it's time to finally build our AI.

  • We're going to program John-Green-bot with a simple language model that takes in a few

  • words and tries to complete the rest of the sentence.

  • So we'll need two key parts, an embedding matrix and a recurrent neural network or RNN.

  • Just like we discussed in the Natural Language Processing video last week, this is anEncoder-Decoder

  • framework.

  • So let's take it apart.

  • An embedding matrix is a big list of vectors, which is basically a big table of numbers,

  • where each row corresponds to a different word.

  • These vector-rows capture how related two words are.

  • So if two words are used in similar ways, then the numbers in their vectors should be

  • similar.

  • But to start, we don't know anything about the words, so we just assign every word a

  • vector with random numbers.

  • Remember we replaced all the words with numbers in our training data, so now when the system

  • reads in a number, it just looks up that row in the table and uses the corresponding vector

  • as an input.

  • Part 1 is done: Words become indices, which become vectors, and our embedding matrix is

  • ready to use.

  • Now, we need a model that can use those vectors intelligently.

  • This is where the RNN comes in.

  • We talked about the structure of a recurrent neural network in our last video too, but

  • it's basically a model that slowly builds a hidden representation by incorporating one

  • new word at a time.

  • Depending on the task, the RNN will combine new knowledge in different ways.

  • With John-Green-bot, we're training our RNN with sequences of words from Vlogbrothers

  • scripts.

  • Ultimately, our AI is trying to build a good summary to make sure a sentence has some overall

  • meaning, and it's keeping track of the last word to produce a sentence that sounds like

  • English.

  • The RNN's output after reading the final word so far in a sentence is what we'll

  • use to predict the next word.

  • And this is what we'll use to train John-Green-bot's AI after we build it. All of this is wrapped up in code block 2.3

  • So Part 2 is done. We've got our embedding matrix and our RNN.

  • Now, we're ready for Step 3: train our model.