Placeholder Image

字幕表 動画を再生する

  • what is going on?

  • Everybody, We're gonna part eight of our check pot with python in tensorflow tutorial.

  • Siri's in this tutorial.

  • What I'd like to do is talk about some of the more high level concepts and parameters of our chat bond with the neural machine translation code that we're using.

  • And I hope thio least give you an idea of a better idea of what's actually going on here because there's actually a lot going on here.

  • My initial intention was to start with the basic secrets sequence, the old English to French translation model that was done with tensorflow.

  • But tensorflow deprecate id it, and you really can't run it without running the tensorflow one point.

  • Oh, it leaves the time on my filming this we're already at 1.4.

  • And so I just was like, Now I'll just start with with the current anti model, but it's gonna be a lot of information that I'm about to throw at you.

  • So I apologize.

  • Here we go.

  • So So, first of all, any time you've got a translation, whether it's, you know, neural machine translation translation from one language to another language or in our case, even though it's English, English, it's still a form of translation that's gonna be going on there.

  • It's just input to an output of language.

  • Any time you have that, obviously words are not numbers.

  • So the first thing that we're gonna we need to do is we need to token eyes, inputs and an easy way you could Token eyes inputs is just a split by space and probably punctuation.

  • Okay, so I am a student token ized.

  • There's a special token for I a token for a space.

  • A token for AM token for space.

  • Token for a and so on.

  • Um so you you token eyes that input.

  • And generally that's what the encoder is gonna do.

  • It's just gonna be token izing your input.

  • And then once it's token ized Thea other thing an encoder is likely to do is take those tokens.

  • And one thing we could do is just a sign it like an arbitrary I d.

  • But that's not ideal.

  • We'd like those ideas to be somewhat meaningful, So we generally are gonna create a word vector that is gonna put similar words and give them similar I.

  • D.

  • S.

  • And this is gonna help both in actually translating.

  • But also, it's gonna help us in evaluating how good our translations were because a lot of times you might be really close.

  • It like like translating.

  • Um, you say you've got airplane in Japanese and you translate it Thio car.

  • Or better yet, airplane Japanese.

  • You translated to car versus airplane and Japanese and you translated Thio, Um, I don't know, shoe or cat or something like that, right?

  • At least car is closer to plane than shoe or cat is.

  • So we would know that as we get closer, that correct translation we want to reward for that we don't wanna have necessarily equally wrong just because we missed the word right.

  • So generally we're gonna have word vectors as well.

  • Once you have those ideas, though, then you're gonna feed it through some sort of a neural network with language information.

  • This is generally a recurrent neural network that just makes sense.

  • So we have that sort of non static temporal sense going forward.

  • Um, and then generally, it's, ah of a recurring role network re usually using long short term memory.

  • L s t m.

  • Then once you've done that, you feed it again through a decoder, and then you've got your output.

  • And that's your basic sequences sequence on language data Translation model.

  • Now, let's start talking about some of the problems that we might have while doing that.

  • So, first of all your input and your output, first of all, these don't even match right.

  • I am.

  • A student is four tokens, not including spaces.

  • And then the output is three tokens again, not including spaces.

  • Okay, so So already there's no match.

  • But also I am.

  • Is your input always gonna be those four tokens or whatever many tokens you you're gonna have?

  • No right.

  • Your input is gonna very right.

  • So So at least the way.

  • Like initially that we saw for that was with padding.

  • So we might say, Okay, what's our longest sense we've ever had?

  • Let's say it's 100 tokens.

  • So the longest sentences, 100 tokens.

  • Let's say we got we just got a sentence.

  • That's five tokens.

  • Well, what we're gonna say is the input layer is always gonna be 100 tokens or 100 nodes and each nodes value.

  • Starting value's just gonna be the idee.

  • Hopefully a meaningful idea.

  • It's gonna be the i d.

  • And then let's say we only have five tokens.

  • Well, we put in those 1st 5 and then we have a special token called the Pad token, and then we're just gonna use that pad took on every single node after.

  • So we just do a bunch of padding, and that's one way we can solve it.

  • But of course, that's not a good idea, most likely, and it doesn't train well and it doesn't perform well because what ends up happening is because we use padding so much longer your sentence, the less impact those later words are gonna wind up having because the neuron hours gonna learn like Okay, these pads don't mean anything.

  • So we're just gonna change the weights.

  • And then we get a long sentence once in a while in those last few words, just don't mean anything.

  • It just doesn't train well to do that.

  • So then again, still on that first sequence sequence, translation.

  • And again, that was also this is Charles.

  • Like if you guys have been following either the twitch stream or you follow Charles the Aye Aye, on Twitter.

  • Um, that was This is what he has been running.

  • So he's been on that kind of V one initial sequence of sequence model from TENSORFLOW until very recently until this serious started.

  • Basically, um, So So the other idea was that we could use bucket ing so that the idea here was like, Okay, well, what we could do is we could have input layers that are buckets so we'll take our tokens, Won't have, like, four buckets will have a bucket for the stuff that's like 5 to 10 long.

  • We'll have a bucket for the stuff that's 11 toe, 25 long a bucket for 25 35 then a bucket for 35 50 or something like that we would do bucket ing.

  • And then whichever bucket held the longest version of that string, that's the bucket that we would use.

  • And we were trained, and this did okay, you know, it did all right, but not ideal, because we still are going to be using padding.

  • It's still so that brings us to today with Tensorflow.

  • We have what are called dynamic, recurrent neural networks, and we can have a dynamic input coming in now, moving back in time a little bit and also getting into now again.

  • Now we're going to be talking.

  • So what kind of talk about up to this point Right before dynamic was that first secrets of sequence stuff from tensorflow.

  • Um, And now what I'm gonna be talking about is actually the empty the current age and empty that were using.

  • So another problem that we have besides the bucket ing, which is solved.

  • But besides the budgeting and padding, which is basically solved by dynamic, recurrent all networks, we have the issue of just language in general.

  • So, for example, let's let's consider a translation chap up a translation task like English to French.

  • In general, if you want to translate English or French or English or German or English or Spanish, there's there might be slight variance in the syntax of like, a noun phrase.

  • But in general there's a algorithmic solving of the translation of English thio, French, German, Spanish, whatever.

  • That's pretty, Pretty simple, I guess.

  • And it goes in a linear order.

  • But then you have a language like Japanese, so English and Japanese are very, very different languages, and they don't follow even remotely similar rules and in Japanese, Sometimes that last character changes the meaning of all the other characters.

  • And things were just totally different, right?

  • And then also with the chat bought, the same thing is true.

  • And in a lot of translations, same thing is true that it's like with an L S t m generally and l s t m is really pretty good at remembering in a sequence about, like, 10 to maybe 20 tokens.

  • Right?

  • So let's say tokens are words.

  • So in our case, we are token izing toe word.

  • Another option you can do to token eyes is to token eyes by character.

  • So each little character a b c d on its own Or you can token eyes by like, kind of like almost Ah, try to think of the word.

  • I can't think of this stupid word.

  • Um, syllable.

  • Anyway, I'm totally blanking.

  • My brain is done Anyways, um, I think it's syllable, just chunks of letters.

  • Basically that had, like, hair is son or hair ISS son, right?

  • You could token eyes by those kind of little bits, and you'll have much less.

  • And that's BP token ization.

  • We might be talking about that later on.

  • But for now, we're token izing by word.

  • So anyway, think of it by words.

  • So think of that how hard it would be for you if you could only remember if you could only respond to 10 to 20 tokens, Max at a time.

  • And really 10.

  • Okay, So you needed to do you think about if you needed to build your response to 10 tokens at a time and build out your response as you slid a window of 10 tokens at a time.

  • So how hard it would be if you write or think about how hard it would be if you write that.

  • Okay.

  • Start building a response.

  • You don't even know if you want.

  • Right.

  • Um, and then how hard it would be if you could only or something like that, right?

  • Keep sliding it, and you have to build a response.

  • The neural network must generate a response.

  • So this could be very challenging if we're on Lee thinking back, um, historically, but also on Lee.

  • If we can only remember into 20 tokens, the other issue is how hard it would be.

  • Okay, think about how hard it would be, and then if you could only and that's kind of like, almost like a filler.

  • If you could only and then think in terms of 10 to 20 tokens at a time, it's It's almost like the first part.

  • In order to understand that first phrase, you have to get that last bit too.

  • But you also couldn't take just the last bit if you could.

  • On Lee.

  • I think in terms of 10 to 20 tokens, right, you know, now, now you don't know how to respond to that.

  • If you could only think of.

  • What do you mean?

  • You have to go back to the think about how it would be right.

  • And so this is where we have kind of two new concepts they're coming in here, which is both, at least in terms of using future data as well with bidirectional recurring role networks.

  • So in a bidirectional, recurrent or network, we're gonna feed data both sequentially forward and then also reverse order backwards through that hidden layer in the model.

  • So that's one thing.

  • And then we're also going to make use of attention models, which you can again, tension models are in this tutorial.

  • There's also a great paper on them.

  • Um, so this kind of explains attention, Miles a little bit.

  • And then I also took an image from the paper on attention models, which I think pretty much drives the entire point home.

  • So this is the a graph of Blue Score to sentence length.

  • So think of these is like your tokens.

  • And this red line here is basically no use of an attention model, and a blue score is bilingual evaluation.

  • Understood anything?

  • It's basically it's just a score of how good a translation.

  • Waas basically So So the more the better.

  • And, um, as you can see, like, the model is pretty good at 10 even at just before 20 it does its best blue score, but then very quickly she just falls off, especially as you get the longer sentences, which is problematic weed.

  • We tend to speak in pretty long sentences, and sometimes you need multiple sentences to understand it.

  • Meaning so then these air, just with with blue score, some attention models being applied and the real thing T drive home here is, it helps both on the very short stuff, but also on the longer tokens.

  • It's still like it, basically flatlines after, like, 40 ish.

  • There's a slight decline there, but it pretty much holds out all the way up to 70 and probably out further from there.

  • So, um, so the attention model's gonna help us remember longer sequences at a time which can help us to kind of brute force our way through this kind of context problem where we need both historical and future information.

  • But then we use bidirectional, recurrent or networks to kind of mix it up.

  • And because language is not necessarily on Lee in perfect sequence, sometimes we have to use context, and we have to hear the full sentence before we can go back and respond to words that we heard leading up to that point.

  • So real quick.

  • What I'd like to do is just visualize a bi directional current neural network, So let's go do that really quick on a simple, recurrent role network.

  • You have your input, layer your output layer, and then we'll have one whom layer for simplicity's sake.

  • Then your connections go from the input layer to the hidden layer, where each note in the hidden layer also passes dated down to the next head and later node, which is how we get our temporal and not so static characteristics from recurrent neural networks.

  • As the previous inputs are allowed to carry forward to some degree down on through that hidden layer on a bidirectional recurrent neural network.

  • Basically, the hidden layer has data that goes both down and up through our business, in both directions.

  • Through that that hidden layers.

  • So you still have your input layer.

  • You're still gonna have your output layer.

  • You're still gonna have the connections from the input to the hidden and from the hidden to the output.

  • But then also in that hit and layer, you've got basically the data that goes in this innocent drawing down and then up or forward and then reverse, depending on what drawing you're looking at.

  • And then from here, you're gonna have.

  • In theory, this is actually fully connected just because again, this is all that hidden layer is really there.

  • Just one hidden layer.

  • All those notes are part of the same hidden layer.

  • But then you might also Seymour fancy types of connections, just thio just to get it more than just simply forward and reverse.

  • Just get a little bit Maur complexity out of the network while we're there.

  • But anyways, that's the difference between a simple Rickert nor network in a bidirectional recurrent neural network.

  • Because in a lot of tasks, it's not just what happened leading up to a certain point, we do actually care what happens after that point as well.

  • So it's still important for us to go actually go both ways.

  • All right, so now what I want to do is cover some of the training metrics that you're gonna come across and give you an idea of kind of what, at least I've found to be useful as far as training the model and kind of tweaking it as time goes on and kind of knowing when it's done.

  • So, uh, the first thing I want to do is show you tense aboard.

  • Hopefully, you're familiar with tense aboard already, But if not, here's.

  • Here's what you can do tow.

  • Run it.

  • Basically, it's just a way you can visualize how training has been.

  • Our training's going basically right now, you can visualize a lot of scale er's.

  • There's some other things that we can do and I'll show you one of them.

  • So first of all, toe run tense aboard.

  • You can run this while your algorithm is training.

  • So we don't want to do is go into the model directory.

  • Don't worry.

  • If you don't have all these same directories, I've made lots of copies and lots of things as I'm training.

  • So anyway, head into model and then trained log is where you have all your logging files.

  • Your pride.

  • Just have event files like this.

  • You won't have necessarily these other files.

  • So from model, what I'm gonna do is I'm gonna open up a command prompt just with CMD typing in there and then just type tensor board dash dash logger equals trained underscore law.

  • And then you would hit in her.

  • Now, actually, I haven't cancel ex actually already have it up.

  • I wanted to bring it up prior because, uh, it can take a little bit to load, and I'm loading like 100,000 steps.

  • So anyway, it's just answer board longer train log now, once you have it up, it's over.

  • This is my tensor board for Charles v to so he was trained with Think about three million pairs.

  • And then right now I'm training a model with, like, 70 million pairs, so hopefully we'll be better than this one.

  • But anyway, this is kind of this.

  • Is this the tensor board and the information that I had?

  • So the big things you wanna pay attention to are the blue score.

  • Blue Blue is probably the best determining factor in how good a translation was.

  • The problem is, we're not really doing translations.

  • So So, basically, with when you're translating English or French, for example, in general there's either just one proper translation.

  • Or maybe there's three or five.

  • But when you're translating English to English, when you're doing like a chap by, like it input to an output comment in a response from any given comment, it's really like infinite responses.

  • There's there's no limit to what the responses and its unless it's not coherent.

  • And unless it doesn't make any sense at all, it's a valid response.

  • So So for me, the blue score is relatively useful.

  • We'd like to see it go up, but I don't think we're going to see a blue score of like 20.

  • Okay, we're probably going to see blue scores around 34 Max.

  • Like, there's really no reason why we would see a super high blue score unless maybe we over fit too much.

  • Like, if you have you trained, Unlike I don't know, you did like 5000 epoxy or something like that, right?