Placeholder Image

字幕表 動画を再生する

  • cSo I wanted to make a video about

  • GPT - 2

  • Because it's been in the news recently

  • this very powerful language model from open AI and I thought it would make sense to start by just doing a video about

  • transformers and language models in general because

  • GPT 2 is a very large

  • Language model implemented as a transformer, but you have a previous video about generating YouTube comments, which is the same kind of task, right?

  • That's a language modeling task from language processing to generate new samples for cooling of the most complex or magnetic

  • Consistent brackets like a computer to expect found in creating organizations

  • I believe that video was made October 2017 and this paper came out December 2017, which has kind of

  • Revolutionized the way that people carry out that kind of task. That's not the GPT - 2 that's something before that, right?

  • That's the transformer, which is a new realm. Yeah relatively new

  • architecture

  • for neural networks, that can do actually all kinds of tasks, but they're especially good at this kind of

  • language modeling task

  • a language model is a probability distribution over like sequences of

  • Tokens or symbols or words or whatever in a language?

  • So for any given like sequence of tokens, it can tell you how likely that is

  • So if you have a good language model of English

  • It can look at a sequence of you know words or characters or whatever and say how likely that is to occur in English

  • How likely that is to be an English phrase or sentence or whatever

  • And when you have that you can use that for a lot of different tasks. So

  • If you want to generate text, then you can you can just sort of sample from that distribution and keep giving it

  • its own output

  • so you you sample a word and then you say

  • And to be clear sampling from a distribution means you're just taking

  • Your you're sort of rolling the dice on that probability distribution and taking whichever one comes out. So

  • so you can like sample a word and then

  • And then say okay conditioning on that given that the first word of this sentence is V

  • What does the probability distribution look like for the second word?

  • And then you sample from that distribution and then it's you know

  • with a cat and you say given that it's the cat what's likely to come next and so on so you can you can build

  • up a

  • string of text by sampling from

  • Your distribution that's one of the things you could use it for

  • most of us kind of have an example of this sort of in our pockets of

  • Its actual absolutely right and that's like that's the that's the way that most people interact with a language model

  • I guess this is how I often start a sentence

  • apparently with I I am not sure if you have any questions or concerns, please visit the

  • Plugin settings so I can do it for the first time in the future of that's no good

  • Here's a different option. Let's just see what this way. Maybe the same

  • I am in the morning

  • But I can't find it on the phone screen from the phone screen on the phone screen on the phone screen on the phone screen

  • On the phone screen. I don't actually know how this is implemented

  • it might be a neural network, but my guess is that it's some kind of

  • like Markov model Markov chain type setup where you just

  • for each word in your language you look at your data set and you see how often a particular

  • how often each other word is

  • Following that word and then that's how you build your distribution

  • So like for the word "I" the most common word to follow that is "am" and there are a few others, you know

  • so this is like a very simple model and

  • This sentence on the phone screen on the phone screen on the phone screen on the phone screen on the phone screen

  • He's actually very unlikely, right?

  • This is the super low probability sentence where I would somebody type this and the thing is it's like myopic

  • It's only I'm not sure I even it's probably only looking at the previous word

  • It might be looking at like the previous two words, but the problem is to look back. It becomes extremely expensive

  • Computationally expensive right?

  • Like you've got I don't know 50,000 words that you might be looking at and so then it so you're you're you're remembering

  • 50,000 probability distributions or

  • 50,000 top three words

  • but you know then if you want to do

  • 2, that's 50,000 squared right and if you want to go back three words

  • You have to cube it. So you like raising it to the power of the number of words back you want to go which is

  • Which means that this type of model?

  • Basically doesn't look back by the time we're saying on the it's already forgotten the previous time

  • It said on the it doesn't realize that it's repeating itself and there are slightly better things you can do in this general area

  • But like fundamentally if you don't remember you're not going to be able to make good sentences

  • If you can't remember the beginning of the sentence by the time you're at the end of it, right?

  • and

  • so

  • One of the big areas of progress in language models is handling long term dependencies

  • I mean handling dependencies of any kind but especially long term dependencies

  • You've got a sentence that's like Shawn came to the hack space to record a video and I talked to

  • Blank right in that situation if your model is good

  • you're expecting like a pronoun probably so it's it's she they

  • You know them whatever and but the relevant piece of information is the words short

  • Which is like all the way at the beginning of the sentence

  • so your model needs to be able to say oh, okay, you know Shawn that's

  • Usually associated with male pronouns, so we'll put the male pronoun in there. And if your model doesn't have that ability to look back

  • Or to just remember what it's just said then

  • You end up with these sentences that?

  • Like go nowhere

  • It's just a slight like it might make a guess

  • just a random guess at a pronoun and might get it wrong or it might just

  • and I talked to and then just be like

  • Frank, you know just like introduced a new name because it's guessing at what's likely to come there and it's completely forgotten that sure was

  • Ever like a thing. So yeah, these kind of dependencies are a big issue with things that you would want to language model to do

  • But we've only so far talked about

  • Language models for generating text in this way, but you can also use them for all kinds of different things. So like

  • people use language models for translation

  • Obviously you have some input sequence that's like in English and you want to output a sequence in French or something like that

  • Having a good language model is really important so that you end up with something. That makes sense

  • Summarization is a task that people often want

  • Where you read in a long piece of text and then you generate a short piece of text. That's like a summary of that

  • that's the kind of thing that you would use a language model for or

  • reading a piece of text and then answering questions about that text or

  • If you want to write like a chatbot that's going to converse with people having a language model as good like basically almost all

  • like natural language processing

  • right is it's useful to have this the other thing is

  • You can use it to enhance

  • Enhance a lot of other language related tasks

  • So if you're doing like speech recognition then having a good language model

  • Like there's a lot of things people can say that sound very similar and to get the right one

  • You need to be like, oh, well, this actually makes sense, you know

  • This word. That sounds very similar

  • Would be incoherent in this sentence. It's a very low probability

  • It's much more likely that they this thing which is like would flow in the language

  • And human beings do this all the time same thing

  • With recognizing text from images, you know

  • You've got two words that look similar or there's some ambiguity or whatever and to resolve that you need

  • an

  • understanding of what word would make sense there what word would fit if you're trying to use a neural network to do the kind of

  • thing we were talking about before, of having a phone, you know autocorrect based on the previous word or two

  • Suppose you've got a sequence of two words going in you've got "so" and then "I" and you put

  • both of these into your network and it will then output, you know

  • like "said" for example as like a sensible next word and then what you do is you throw away or so and you then

  • Bring your set around and you make a new

  • Sequence which is I said and then put that into your network and it will put out

  • like I said - for example would make sense and so on and you keep going around, but the problem is

  • This length is really short you try and make this long enough to contain an entire

  • Sentence just an ordinary length sentence and this problem starts to become really really hard

  • And networks have a hard time learning it and you don't get very good performance

  • and even then

  • You're still like have this absolute hard limit on how long a thing you you have to just pick a number

  • That's like how far back am I looking a better thing to do you say recurring neural network? Where you

  • You give the thing. Let's like divide that up

  • So in this case, then you have a network you give it this vector?

  • You just like have a bunch of numbers which is gonna be like the memory

  • for that network is the idea like the problem is it's forgotten in the beginning of the sentence by the time it gets to the

  • end so we've got to give it some way of remembering and

  • rather than feeding it the entire sentence every time you give it this vector and

  • you give it to just one word at a time of your inputs and

  • This vector, which you initialize I guess with zeros. I want to be clear

  • This is not something that I've studied in a huge amount of detail

  • I'm just like giving the overall like structure of the thing. But the point is you give it this vector and the word and

  • it outputs its guess for the next word and also a

  • Modified version of that vector that you then for the next thing you give it

  • where did it spit out or the sequence that it spit out and

  • Its own modified version of the vector every cycle that goes around. It's modifying this memory

  • Once this system is like trained very well

  • If you give it if you give it the first word Shawn then part of this vector is going to contain some

  • information that's like this subject of this sentence is the word short and

  • some other part will probably keep track of like

  • We expect to use a male pronoun for this sentence and that kind of thing

  • So you take this and give it to that and these are just two instances of the same network, and then it keeps going

  • every time

  • So it spits out like this is I so then the AI also comes around to here you might then put outside and so on

  • But it's got this continuous thread of

  • of memory effectively going through because it keeps passing the thing through in principle if it figures out something important at the beginning of

  • You know

  • The complete works of Shakespeare that it's generating. There's nothing

  • Strictly speaking stopping that from persisting from being passed through

  • From from iteration to iteration to iteration every time

  • In practice, it doesn't work that way because in practice

  • The whole thing is being messed with by the network on every step and so in in the training process it's going to learn

  • That it performs best when it leaves most of it alone and it doesn't just randomly change the whole thing

  • But by the time you're on the fiftieth word of your sentence

  • whatever the network decided to do on the first word of the sentence is a

  • photocopy of a photocopy of a photocopy of a photocopy and so