字幕表 動画を再生する
In the previous video we were talking about
transformers this architecture that uses attention to give
Unprecedented ly good performance on sort of language modeling tasks and some other tasks as well
but when were looking at language modeling and that was in preparation to make a video about
GPG 2, which is this very giant language model that has been there was recently
Well, it was recently not released actually by open AI the way that they generated the data set for this is pretty cool
to get enough text they went to Reddit and
They pulled every website that is linked to from reddit. Do we have any idea of how many days lots?
Literally, everything was everything that had more than three karma
I think or maybe more than two karma something like that like
Anything that had somebody had thought to post around it and at least two or three people who had thought was good enough to upload
They scraped the text from that. It's pretty much just a transformer. It's not the the
Architecture is not especially novel. They haven't done any like amazing new
new discovery, but
What they realized was?
Transformers it seems like
the more data you give them the better they do and the bigger you make them the better they do and
Everything that we built up until this point is clearly not
Like we haven't hit the limits of what this can do
We they thought we think we're probably
Bottle necked on data and maybe network size
So what happens if we'd like to turn that 211 what happens if we just give this all?
The data and make a really big one. It makes sense to talk about the acronym right so it's a generative pre-training
Transformer so generative same as generative adversarial network. It generates outputs to generate samples
Your pre-trained is this thing. I was talking about all of the different things
You can use a language model for right you can do you can do translation. You can try and resolve ambiguities
You can do summarization. You can answer questions. You can use the probabilities for augmenting other systems
So yeah, there's a bunch of different benchmarks for these different tasks
that you might want your language model to do and
This is what we talked about in the grid worlds video of having these like standardized problems with standardized metrics and standardized data sets
So that if you're comparing two different methods, you know that you're actually comparing apples to apples
And this is like very important it gives you numbers on these things. It's often quite difficult
Expected to like you're generating samples of text and it's like how plausible is this text? How realistic does it look like?
How do you put a number on that it's kind of difficult. So there's all of these standardized metrics and
the thing that
People came to realize which actually I mean I say that as though it's like some amazing discovery
It's fairly obvious. If you train your system in a like an unsupervised way on a large corpus of just general English text and
then you take that and
Train that with the data from this benchmark or the data from that benchmark
You can like fine-tune it so you start with something which has like a decent
Understanding of how English works more or less and then you say now I'm going to give you these
Samples for like question answering or I'm going to build a system using that to solve to go for this benchmark
So it's pre trained you start with something. That's like a general-purpose language model and then you from that a
Fine-tuned it to whichever
Actual benchmark or problem you're trying to solve
and this
Can give you better performance than to starting from nothing and training to each of the benchmarks from scratch
make sense
and so
The point of the GPT 2 paper the thing that makes it cool is they said okay if we make a really huge one
What if we?
don't
Fine tune it at all
What if we just make a giant model and then just try and run it on the benchmarks without messing with it?
Without showing it any of their specialized data for that benchmark. Just the raw
general-purpose language model, how does that perform and it turns out
surprisingly well, so this is a
Very very large data set for text
It's about 40 gigabytes
which
Actually doesn't sound like very much but like for text text that's insane, right? It's
somebody said that this was the size of
Google's entire index of the Internet in 98
So like it's yeah, it's a lot of text
and they trained it on that and they ended up with a
1.5 billion parameter model, but which is like a previous state of the art system was 345 million
This is 1.5 billion
So they've just made the thing much much bigger and it performs really well some of their samples that they published quite
captured the public imagination
You could say and now that we've talked a little about the problems that
Neural networks or any language model really?
Has with a long term dependency
we can now realise just how impressive these samples are because when you look at them as a you know,
If you look at them uninitiated, you're like yeah, that's pretty realistic
It seems to like make sense and it's cool. But when you look at it knowing how language models work, it's like
very impressive the the coherence and the
Consistency and the long-range dependencies so we can look at this one that got everybody's attention the unicorns one
right
So they prompted it with in a shocking finding scientists discovered a herd of unicorns
living in a remote previously unexplored valley in the Andes Mountains
Even more surprising to the researchers was the fact that the unicorns spoke perfect English
And from there you then say you go to your language model gbgt, and you say given that we started with this
What's the next word and what's the word after that and so on?
So it goes on the scientist named the population after their distinctive horn of its unicorn
These four horned silver white unicorns were previously unknown to science
We do have a clue here as a human being unicorns for horned doesn't quite make sense
But nonetheless we're going okay
Now after almost two centuries the mystery of what sparked this odd phenomenon is finally solved. Dr
Budetti Jorge Jorge Perez
Jo are G an evolutionary biologist from the University of La Paz
This is impressive because we've mentioned the Andes Mountains in our prompt and so now it's saying okay
This is clearly, you know in a shocking finding. This is a science press release news article
It's seen enough of those because it has every single one that was ever linked to from reddit, right?
So it knows how these go it knows. Okay third paragraph
This is when we talk about the scientist, we interview the scientist, right? Okay
First word of the scientist paragraph, dr. Obviously, right because this is the now we're in the name of the scientist
What name are we going to give?
It needs to be a name
conditioning on the fact that we have the Andes Mountains
So we need to get where we're in South America
The name probably should be Spanish or maybe Portuguese
So we get we get dr. Perez here
And then evolutionary biologist makes sense because we're talking about animals
from the University of La Paz again
This is the first sentence like when you have that first clause that introduces the scientist you always say where they're from
So we say from the University of and then university names tend to be the name of a city
What's the city where we have the Andes Mountains, so we're going to Bolivia lapaz. Perfect
And the thing that's cool about this is it's remembered all of these things that were quite a long time ago several sentences ago
Well, it hasn't remembered them. It's paid attention to them across that distance, which is impressive
But also this is encoding a bunch of understand understanding a bunch of information about the real world
Right all that was given all it knows is statistical relationships between words, but the way that it comes out to us
Is that it knows?
Where the Andes Mountains are what kind of names people in that area have what their cities are what the universities are all of those
Facts about the real world because in order to have a really good language model it turns out you have to kind of implicitly encode
information about the world because
We use language to talk about the world and knowing what's likely to come next
Requires actual real world understanding and that's something that we see in some of the other
Things that they got it to do you can see the real world understanding coming through
Let's keep going
University of a person several companions were exploring the Andes Mountains when they found a small valley with no other animals or humans peres see
We're hanging on to him. Yep. We're referring to him again
but now we've changed it to be just the surname because that's the
format that people use in news articles Peres noticed that the valley had what appeared to be a natural fountain surrounded by two peaks of
Rock and silver snow presently others, then ventured further into the valley a round about here in our article
We should have a quote from the scientist right quote
By the time we reached the top of one peak the water looked blue with some crystals on top and we're talking about this fountain
I guess it's natural fountain. We're referring back to the previous int. It's like everything is
Relying on in contingent on earlier parts of the text while examining there by snipped paragraph while examining these bizarre
Creatures the scientists discovered that the creatures also spoke some fairly regular English know when I read that I like, okay
this is now unusually good because that's the second sentence of the lead right where six paragraphs in and
It knows about this point. I've covered the first sentence of this
initial paragraph
now it's time to talk about this second sentence of the lead even more surprising to the research of us of the fact that they
spoke English and
It completely ignored the speaking English part until it got to the part of the news article where that comes in
You've gone six whole paragraphs
the idea of
Accurately remembering that the unicorn speak perfect
English is like that's very impressive to me and then it goes into its gets a little bit unhinged
Starts talking about it's likely that the only way of knowing for sure if unicorns are indeed
The descendants of a lost alien race is through DNA. That's read it really
Well, it's not actually stuff on reddit. It's stuff linked to from reddit. But yeah, this is this is news articles men
They seem to be able to communicate in English quite well
Which I believe is a sign of evolution or at least a change in social organization said the scientist
That's his evolutionary biology there. Right? Right, right. Yeah, we know here's an evolutionary biologist. So so the the
coherence of this text is
really dependent on its ability to
Condition what it's generating on
Things that it's generated a long time ago
So yeah
So it can generate really nice news articles and it can generate all kinds of text things that it anything that is
Sufficiently well represented in the original data set. So that's GPG - it's a really
Unusually powerful and like versatile
language model that can do all of these different natural language processing
Tasks without actually being trained specifically on those tasks
It's really and that's that's why it's impressive
It's not that it's a it's a brand new architecture or a brand new approach or whatever
It's just when you make these things really huge and give them tremendously large amounts of data
The results are really impressive
In the original data set. So it will it will write you the Lord of the Rings fan fiction
It will write you cake recipes if we're like, there's all kinds of examples of different samples. Here's a recipe for
Some kind of peppermint chocolate cake and it's got a bunch of different