Placeholder Image

字幕表 動画を再生する

  • what is going on?

  • Everybody.

  • And welcome to part four of the reinforcement learning Siri's in this video we're gonna be doing is building our own cue learning environment.

  • So the first thing I wanted to do when I learned to learning yes, it was useful to use the Open.

  • Aye, aye, Jim Environment.

  • But the first thing I want to do is make my own environment, and I didn't really intend to make a tutorial out of it, but people were asking.

  • And then it's kind of like, Well, obviously that's what I wanted to do.

  • So of course, that's what other people probably want to do, too.

  • So anyway, here we are.

  • Let's make our own environment.

  • So I'm actually just gonna kind of run through the environment that I made and explain it as I go.

  • So well, we're just gonna do the program linearly.

  • Ah, but if you have questions, whatever, as always, comment below, join the discord dot gov slash Centex, um, and feel free to ask any questions, but it should be pretty self explanatory, super complicated environment.

  • And just in case, I guess I'll explain it a little bit before we get in But so, first of all, I love blobs every I used blobs for, like, examples of everything.

  • So if you haven't noticed by now, um I'm a blob.

  • A file anyway, So I always wanna make blob things.

  • And so the idea was to make blobs, and in this case, you've got a player blob of food Blob.

  • That's the objective we're trying to achieve.

  • And then just for, uh, you know, complex city, I added an enemy blob as well.

  • So the idea is that you have this player blob and we'll start by having the enemy and the food not moving, just stay stationary and they just kind of initialize in random positions.

  • And then the player blob has to move to get to that position, and then later we can add in movement.

  • It's just that letting the things move is kind of irrelevant to the model.

  • Learning howto actually move except in one example, which is so if the enemy is moving and you're moving, you, couldn't they remove into each other.

  • If the enemy is not moving in training, you, you might learn that it's totally fine to get close to the enemy as long as you don't hit the enemy.

  • But if the enemy can also move, they could collide.

  • But anyway, not really worried about that.

  • But you guys can feel for you to tinker with that.

  • Anyway, let's get into it.

  • So we're gonna be using open CV.

  • So the first thing that I want to do is open up a command prompt and let's go ahead and pip Install python dash open CVS.

  • Make sure we have that is open so it might be open C v dash python hoping C v dash Python.

  • Okay, I feel like I always get that wrong anyway.

  • Okay, Pip, install open C V dash pine time.

  • If for whatever reason Python dash opens evey did install for you, you'll probably just installed something really bad.

  • So don't use it.

  • Get rid of it moving along.

  • So, uh, all right, so we are also going to use numb pie, but you should already have that.

  • And then we're going to use the python imaging library, and I want to say that would be a pip install pillow.

  • This is a virtual machine, and it should be clean.

  • So if that's not what I want, Um we'll find out soon enough.

  • But we'll get robbed.

  • Pillow and then to use pillow you import capital P I L So that should be everything we need.

  • Let's go ahead and in poor in poor, numb pie as in peace or use numb pie for the array types of stuff from pl we're going to import import how to type image with capital I import CV to We're gonna import Matt plot lib dot pipe lot as p l t.

  • We're going to import pickle on.

  • That's to save and load our cue table.

  • Then we're going to from Matt.

  • Plot lived will import style just to make her graph pretty.

  • And we'll do style that use g pliant, uh, and then finally will import time.

  • We're unused time purely to set dynamic que table file names.

  • And so it's It has, like, some order.

  • What's your deal?

  • Probably to underfund 00 it's type of style, huh?

  • Okay, great.

  • So the first thing I'm gonna do is just try to run this real quick, make sure it actually runs.

  • It does.

  • Okay, so all our imports worked, and we're gonna get so s.

  • So now we're gonna have some lips.

  • Did I just delete?

  • Totally did.

  • Okay, so now we're gonna start with some of our constant.

  • So we're going to say size and we'll say this is 10.

  • So we're going to skip the whole, make a huge environment and then boil it down to action or observation spaces.

  • Discreet observation spaces.

  • Just skip that step.

  • I'm just gonna make it a grid.

  • So in this case, it's going to be a 10 by 10 grid.

  • So the player, the food and the enemy will be initialized at a random location on a 10 by 10.

  • But we can change this.

  • As time goes on, I'll talk about how some of those changes will impact things.

  • But obviously, as you increase this size, especially depending on the size of your actions basis, well, that is going to just exponentially explode the number of, ah, you know, possible combinations in your cue table.

  • So, anyway, we'll start with a 10 by 10.

  • That should be pretty simple.

  • How many episodes of H M episodes was A 25,000 were going to say we're going to give a move, Move, underscore Penalty.

  • That's gonna be a one.

  • An enemy underscore.

  • Penalty.

  • So this is if we if we hit the enemy well, it's a it's a 300 so we'll subtract that penalty.

  • Basically, we're going to say a food reward 25.

  • This is I haven't really decided where I want this.

  • I don't know if I really wanted to be 10 like we had with the mountain car.

  • 25.

  • Haven't really decided.

  • I don't really know what's the best way to it was the best one to you, so I'm just throwing 25.

  • Um, I haven't noticed anything.

  • You know, huge.

  • Tell me one way or the other.

  • Um, we're gonna have capsule line lower case, because it's gonna change over time.

  • We'll start at 0.9.

  • Another thing that we could change a time.

  • EPPS decay.

  • So it's Epsilon decay.

  • Zero point I went with triple 98 and the way I came into this was I just made a four loop and I literally just decayed.

  • So I just took Absalon times.

  • I'm sorry for, you know, for I and range say 25,000 multiplied this and I just kind of looked and I was like, Okay, that looks good.

  • So I literally just pulled that number out of nowhere.

  • Moving along.

  • Um, now we'll say show underscore.

  • Every says it's just like before.

  • How often do we want to show?

  • I'm gonna say every 3000 episodes, if I recall.

  • Right This, Actually, my environment is actually a little quicker than, uh, the mountain car one.

  • Okay.

  • Show every 3000.

  • Okay, so now what we're gonna say is start, we'll have a variable here, start Q table table.

  • And for now, we'll say none.

  • But then it could or file name.

  • So if you happen to have an existing que table that you want to, like, load in and train from that point, you would throw that right in there.

  • Now, you might want to do this for a variety of reasons.

  • You want to continue training, or, uh or at least we haven't really coated in a good way to, like, decay Epsilon to zero in them, you know, trained for a little bit and then reintroduce epsilon.

  • We don't really have a good way to do that.

  • So if you really wanted to do that, you could decay epistle on zero or trained for a little bit, as you know, with very low or no Absalon and then loaded in ad Absalon and continue.

  • And I've actually found that that works really well.

  • Toe like decay.

  • Absalon let it go for a little bit and then set ups long back 2.9 or something indicate again and then let it train for a while, then decay again.

  • And keep repeating that that that seems actually learned pretty well.

  • So anyway, for now, it's none because we don't have a cute table.

  • But if we had one, we just passed the file name.

  • So now we'll throw in learning.

  • Underscore.

  • Rates were going to say 0.1 discount will set again.

  • 0.95 These two variables, I really haven't played with much, so I couldn't really tell you how it's gonna impact things.

  • But now that you know how to do analysis from the previous video feel for you tinker with them.

  • Uh, now what I'm gonna do is I'm gonna give a player in equals one.

  • We're gonna save food.

  • And these are just numbers for labels and keys.

  • Rather in a dictionary.

  • That's it.

  • So player and food and and and enemy end three again.

  • It's just definitions.

  • For what number?

  • These things represent in a dictionary.

  • So we're just a d equals and then we're gonna have one Colin something.

  • Um, biologist, Fill it in.

  • So these will be the color.

  • So to 55 1 75 0 Now, I'm actually defining these colors in B G or format.

  • Uh, even though I'm really true.

  • I was really trying to use RGB.

  • So side project is if anybody can tell me why is it be gr I don't think it's like the I don't know.

  • Maybe I'll figure it out as we go, but that's that was a problem I haven't solved yet anyway, to 55 0 So the player B is mostly blue, some green, so it's kind of like a lite ish blue.

  • Um, food is B g.

  • So full green.

  • So the food is green, and then the enemy will be 00 to 55.

  • So beauty are so maximum are so it'll be very red.

  • Cool.

  • So, um, you don't like that, huh?

  • All right.

  • I'm surprised it doesn't surprise.

  • Is accepting my spacing over of the dictionary.

  • I don't think that's Pepe.

  • Maybe it is.

  • Must it must be, if it's accepting it.

  • Uh, okay, so now what we need is we need a blob class because, really, all of these blobs are gonna have a lot of the same attributes.

  • At least they're going to need to be able to move.

  • They're going to need a starting location like they need to be initialized randomly.

  • They're gonna need to be able to be moved.

  • And then later, basically our observation.

  • I didn't really want to use because you would have, like, a huge observation space if you needed to pass.

  • What is the location of every bully?

  • I felt like it.

  • You'd have to be.

  • The problem would be much more complex, I guess if you passed these the physical location of everything.

  • So my my my plan is instead the observation space is actually going to be the relative position off the food and then the relative position of the enemy to the player.

  • That's gonna be the observation.

  • So to get that, we need to be able to perform some sort of ah operation that will produce that value.

  • So anyway, we're gonna make a blob class so that we can handle for that.

  • I get asked all the time.

  • Hey, why don't you code an object when you're programming all the time?

  • It's like, Well, I just use it where it's actually useful.

  • And in this case, it seems to me that making a blood class is actually useful.

  • So anyway, class blob.

  • So define our innit method.

  • And if you don't know object during your programming in Python, um, I would go to python programming dot nets go to Python Fundamentals Intermediate Python.

  • And then in this series, we get into object oriented programming here, and I would strongly recommend you You go through this if it's if it's confusing to you.

  • Uh and then we also we're going to do some operator overriding and stuff like that.

  • So, uh, you might want to check that out, if that's confusing to you.

  • So our innit method?

  • This is just gonna run immediately.

  • So we're gonna say self dot x.

  • Is it going to be equal to dump i dot random dot rand int between zero and the size all caps, and then we're gonna do the same thing for self dot Why?

  • So that will just initialize randomly.

  • Um, it's gonna cause a slight issue.

  • So if well, in one case, the player could spawn right on the food, that would be lame.

  • The other one is a player could spawn right on the enemy.

  • That would also be lame.

  • And finally, the enemy could spawn on the food or the food on the enemy.

  • Whichever way you want to look at it.

  • Glass half full, half empty kind of thing.

  • Um, those who don't that that kind of stinks.

  • That that's a possibility.

  • I'm not gonna handle for that because I don't really feel like writing code for that.

  • Uh, fine.

  • We're going to make a string method.

  • Um, not even sure I use this for debugging purposes, but then I'm not even sure we need it now.

  • But anyway, I'm gonna return an f string of, um, self dot Exe.

  • And then we'll just say comma, uh, self doubt wife.

  • So if we want to print the blob, it'll just print the blobs location.

  • So now what we want to do is I'm gonna override Define the subtraction operator here.

  • Um, over operator.

  • Overloading.

  • Seminal.

  • Correct me, actually.

  • And is there any way return So this just allows us to to subtract a blob from another blob.

  • So in fact, we need another blob.

  • So I'm gonna say other.

  • So we'll pat.

  • So when we subtract when you do it a minus and then the thing that thing will get past here and then in here, we're going to say, um, return.

  • And in this case, what we're going to say is return and then in parentheses.

  • Orders a self dot x minus other dot x er comma self dot y minus other dot y cool.

  • It looks like Pepe doesn't care either way.

  • If you want a space here or not, I'm gonna get a space.

  • OK, so that's our subtraction operator.

  • Overloading its operator.

  • Overloading.

  • Gonna go with that.

  • Now I'm gonna define basically, this blob needs to.

  • The player is going to take a discreet action, right?

  • So we're gonna have an action method, but really, the player is the only one that's gonna use it.

  • But there might be a time where you wanna have many players?

  • Um, so I'm just gonna I'm gonna go ahead and do this, like in this, like, Q learning.

  • We're not gonna have many players.

  • I don't think I can't really see a good reason why I would do that.

  • But later, we might want to do that.

  • And then, um so action will actually interact with another method I was gonna pass for now.

  • And we're gonna call this move method.

  • You know, it'll be self and then probably X and why?

  • Um and in fact, I'm gonna set X and y just default to false before I forget.

  • Um, okay, so in action, we want to be ableto pass a discrete value.

  • So we'll say self choice.

  • And then we're just going to say if choice equals zero now, we're gonna start taking, um, some action.

  • So we're gonna say self dot Move.

  • We're gonna say X equal.

  • X equals one y equals one.

  • So it's gonna move one and one.

  • Um, Now, what I'm gonna do is I'm just gonna copy this pace this and this will be l LF choice equals one.

  • And we were going to say negative one and negative one.

  • And then again, we're just gonna copy pays paste.

  • Set this to to set this to three and then negative one and one and then one in negative one, and those are going to all the choices possible.

  • Wanted to do this just to keep the action space relatively small, but this means the player can only move diagonally.

  • Now, if I had thought of this when I was coding this, I wouldn't have continued.

  • I would have actually done each of sorry just punched the mike, which I wouldn't have done that.

  • I would have also added all of the X on Lee and why only movements as well so it could move up and down because to me, I would not have expected que learning to be ableto handle for this plate.

  • Like I said, I actually made this environment before.

  • I actually thought I was gonna show you guys this environment.

  • So I already know it's gonna work.

  • It works by using the wall.

  • So if it gets to the wall and then tries to move diagonally, it moves up or down.

  • Right?

  • Or so I decided to pay on which wallets at, uh so So it's still actually works.

  • And I'm still amazed that it learns to use the wall because, as you'll see, it has no idea about the wallet.

  • Doesn't know there's any boundary to its environment.

  • So the fact that it learns to use the Wallace fascinating.

  • And I don't know if it's because because it's not smart enough to not like deep learning where if it it knows it's on evens or odds, and then the food is on and off like it.

  • Let's say the exes, even for the agent.

  • And then you've got like you don't like.

  • It's an exact combination that, like a deep learning model, could figure out and be like, Oh, we need to flip this.

  • Let's go to the wall.

  • But for Kyu learning it, really I don't think it should be able to learn that this is not how it works, but it actually does work in this model can be trained for, like almost 100% because, like in theory, half of the time when it gets initialized randomly, the food should be in a location inaccessible without using the wall.

  • But this model actually learns to succeed almost 100% of the time, so ask mind boggling.

  • But anyway, if you want to add more choices, go for it.

  • Um, I'm gonna leave it this way, but just take note that it can only move diagonally, and it's amazing.

  • So now it's code, the move method.

  • And basically, we want to be able to move randomly, if that's if there's no value past here or move very specifically.

  • So we're going to say if not X, then what we want to do is we're gonna say self dot x plus equals np dot random dot rand int off between negative one and two.

  • So it's from negative one up to two.

  • So that just means that will either do a negative 10 or a one.

  • So keep in mind, it has the ability to do zero.

  • So this random movement can move up or down.

  • Fascinating.

  • If not X else self dot X plus equals X.

  • Then we want to do the exact same thing for the why.

  • So I'm actually just gonna copy this, come down here paste.

  • So why, uh, why, Why and why?

  • Why, though?

  • Okay, so now s so that's it.

  • So that's handling for X and why the problem here now becomes when it hits the wall.

  • So recall our size is a 10.

  • We're trying to make a 10 by 10 grip.

  • So if the agent attempts to move beyond that, we need to We need to thwart it.

  • Same thing with the food.

  • If we allow the food move, spoiler alert will turn on movement.

  • You'll get to see it.

  • Um, or at least I'll show you video of it.

  • It'll take pride too long, toe.

  • Do it.

  • But you can trade it on your own.

  • So, uh, okay.

  • So if self dot X is less than zero because it could be at the zero position, it just can't be less than zero.

  • We're gonna say self doubt.

  • X equals zero.

  • No colon required there.

  • Um l if self dot exe self dot x is greater than thes size, uh, minus one.

  • Rather so again, the sizes 10.

  • But the positions are zero through nine.

  • So there's 10 positions.

  • But that's what's so if it's greater than nine were actually at the 11th position.

  • So we say size minus one There, Then what we're going to say is self dot X equals size minus one.

  • We need a fix size there.

  • Okay, Now we go to the exact same thing for why so say again?

  • Why and then we'll do Why, Why and why?

  • Fabulous.

  • So now Ah, that should be everything we need for our blob class, which I will now refer to his blob jets which you've all heard if you've seen any of my tutorials with blobs.

  • So now what we want to dio is either create the Q table or load the Q table.

  • So what we're gonna say here is if start Q table is none.

  • So if we don't have the start Q table Ah, we need to create one.

  • So we're gonna take you underscored table, uh, equals a.

  • We're going to just make it a dictionary.

  • We could do like something similar to before.

  • Um, but we're gonna do different.

  • So, uh, well, we're going to say here is for I in range of negative size plus one to size.

  • This needs to be all caps signs.

  • So so again, negative size plus one.

  • Because we again we need to shift, um, and then size.

  • And we're not doing minus one here because all the ranges are always up to, so we're not actually including the 10.

  • So we're gonna do that now we're going to Dio is do this Three more times.

  • So we've got 12 OK?

  • Because we need every so think of it this way.

  • We have to coordinates.

  • Right.

  • So we have the basically the observation space is going to coordinates where the 1st 1 is going to be the delta to the food.

  • So the relative difference, because we're going to subtract the blocks, right?

  • We're going to subtract thes two blobs from each other.

  • So where we overrode here?

  • So it's gonna be the subtraction of the plate current player of the player to the food.

  • So that's gonna return one to pull of X and Y values, and then the other one is going to be the difference, or the subtraction or the delta of the player to the enemy.

  • So it'll be another to pull, So that'll be ex wife.

  • So it'll be like ex would.

  • Why one and then X to y two.

  • Okay.

  • It'll look like that.

  • That's our observation space.

  • Now, to get every combination we're gonna reiterate through them.

  • So we've got, uh, I I I I Okay.

  • We could also get fancy.

  • We could say X one.

  • Why?

  • One x two?

  • Uh, why to Okay, that might make a little more sense since I just showed that other example.

  • So now we have every combination we wanna add every combination to our table here.

  • So, um, so the way that we're gonna do that is I'm gonna look at my code because I want this because this is something I would I didn't say I would.

  • Fudge up, Q table, um, and then that for the table.

  • We're actually gonna pass the to pull.

  • So it'll be a two pool.

  • And I changed my values.

  • I already kind of screwed myself here.

  • Um, like this.

  • Right.

  • So that's gonna be the key is a to pull off two balls.

  • So we're gonna say X one.

  • Why one x to y two and let's fix this.

  • Right.

  • OK, cool.

  • So that we'll save that, are you?

  • They go, Pepe.

  • Okay, so then now we just need to initialize with random values, so we'll come up, we'll just initialize the same way pretty much as before.

  • So that's going to be equal to, um and then so basically, each each observation space needs, um four, uh, four.

  • Random values, because we have our action Space is four.

  • We have four discreet actions.

  • It's hard to say m v dot random uniph uni form and I'm gonna say native 5 to 0 and then we're gonna say four I in range four.

  • Cool.

  • I think that's everything.

  • Looks good.

  • Make pep ain't happy except for the line limit petition to remove the line limit from Pepe.

  • There's just like so many times when the line limit, it's stupid and it like it would be dumb toe not exceed the line limit.

  • Okay, so cute.

  • Okay, so we've got our cue table values.

  • That's it for acute tables initialized.

  • Now we're gonna say else with open, uh, we basically Well, we're gonna do start Q table.

  • So that's that file name R B as f were going to say Now Q table equals pickle dot load F.

  • So if we have one pretty trained, awesome load that, um or rather if we don't have one pretty trained fine, we'll make one otherwise awesome blow that.

  • So then we just have our cue code or training or whatever you wanna call it to do.

  • So we're gonna stay for at this sewed in range of how many episodes do we want?

  • We're going to say the player equals a blub.

  • The food equals blob.

  • An enemy equals a blob.

  • Easy.

  • Then we're gonna say, if episode ma djalo show every is equal to zero, We're going to say we're gonna print an F String on number episode Just so we know kind of where we where we are.

  • And they were going to say the Upsell lawn.

  • Uh, also So we kind of know where we are.

  • Absolute unwise.

  • Uh, then we are going tol this print.

  • One more we're going to say, uh, shit lips show every episode mean And we're just going to display the mean episode rewards we're gonna say n p dot mean and that's episode underscore rewards negative show every colon.

  • I just don't recall making the episode rewards.

  • I don't think we did.

  • Oh, now it.

  • I saw it show up initially.

  • Anyway, uh, app is episode underscore re words.

  • That's just going to be an empty list.

  • Okay, uh, and then the other thing we'll do is show equals.

  • True else will set show equals false, and this is just so we can actually display the thing.

  • So now we're gonna get to our actual so That's for episode.

  • Now, what we want to say is episode underscore reward, uh, equals zero.

  • Because we started zero, and therefore I in range will say 200.

  • This will be how many steps we take.

  • I'm hard coating that.

  • You could later add that as a parameter, if you like s So the first thing is, we need an observation.

  • Um, so we're gonna say OBS equals and that's going to be player minus the food comma player minus the enemy.

  • So that's our operator overloading in action.

  • Then we're gonna have our random movement stuff.

  • So we're just going to say if n p dot random dot random is greater than the f cyl on a CZ, Long as that's the case, we're gonna do a regular action.

  • So there's a action equals np dot Argh max of the Q table at the location of our observation.

  • Um, otherwise, if that's not the case, we're gonna say else action equals np dot random dot rand int between zero and four Again, it's up to 4012 or three.

  • So Oh, now that we got the action, we're ready to take the actions or as a player dot action.

  • Action.

  • So we take that action, which, um, based on the action past moves based on these things that we passed, Um, And again, I just kind of wanted to separate these cause, sometimes, like, you might want to move a thing.

  • Maybe that thing we wanted to move more than one value for whatever reason, like the movement and then the action to move, I think should be separate for this simple example at this stage that were at Maybe it's unnecessary, but I think it's good moving forward anyway.

  • Um, no one was complaining about that.

  • Um, And then what I'm gonna do is I'm just going to say maybe, maybe later we might.

  • I want to say something like this and in me dot Move and then food, right?

  • You might want to let them move.

  • That would complicate things.

  • I think for training purposes, it's better to not let them move.

  • Initially, accept, just, like, really complicates things.

  • I think again haven't actually tested either way.

  • I just initially thought No, no point.

  • And then later we can We should be able to add movement, and the agent should be ableto handle for that pretty well except for the one case that I was explaining.

  • Where?

  • What about?

  • You know, in training, if the enemy can't move, it's okay to even, like brush alongside the enemy.

  • But if the enemy could move, you wouldn't want to take actions like that'd be too dangerous because it could move into us.

  • So now what we're going to dio is code, um, for the actual reward process After we take her action.

  • What was the result?

  • So what we're gonna say is, if player dot x equals enemy dot X and player doubt why equals enemy dot Why you done screwed up reward equals a negative enemy penalty.

  • Remember the negative.

  • Otherwise, your agent is gonna love the enemy.

  • Okay, So if that's the case now, we're gonna say it.

  • LF because we're gonna assume like we definitely don't want, we always want to avoid the enemy.

  • So if the enemy is sitting on top of the food or something, we'd rather just avoid the enemy S o l f player dot X equals food.

  • Done.

  • X and player dot y equals food out.

  • Why, then we're going to say reward equals was a food reward Yeah, food reward all caps else.

  • So if those aren't the case, reward equals negative.

  • Move, penalty.

  • So I think it's negative One.

  • Great.

  • Now, to do our cue function, we need to know we need to be able to, like, move.

  • We We need to make a new observation based on the movement.

  • So what we're gonna say here is new underscore.

  • Observation equals because we've we've made the move.

  • So, uh, when we took that action, now we've moved.

  • So now we're gonna come up here.

  • Ah, and say the new observation is equal to a player minus food player Mike Player minus enemy.

  • And, um yeah, and then after that move, we need to get we need to grab the max future Q.

  • That's equal tow.

  • N p dot the max value of Q table for the new observation.

  • So that gives us the maximum future.

  • Cute.

  • And then the current Q is equal to or a Q table at, um, table at observation.

  • And the action that we took that won't necessarily always be the maximum or the organ, Max.

  • Whatever.

  • So okay, so current que max not Ord Max.

  • Anyway.

  • Okay, so now we can actually calculate or Q.

  • So, um, so now what I would say is if reward is equal to food reward, um, then the new so forward Eagles, food reward, new underscore cute eyes equal to food reward.

  • Because once we have achieved food, we're done.

  • Um, the same thing would be true.

  • L If reward equals enemy penalty if reward, um, that should be equals Negative enemy penalty.

  • If l a forward Eagles native penalty, then we're going to say new Q equals negative.

  • Don't forget, huh?

  • I really badly want to make that mistake.

  • Uh, else nuke you will be the Q formula, so that'll be equals one minus The learning learning underscore Rates one minus the learning rate.

  • Uh, times the current que current que plus the learning rate times parentheses, reward Plus the discount times The Max Future Q.

  • Let's fix our pep eight.

  • Cool.

  • So that's our new cue function.

  • All zoom out.

  • Pause if you I need to catch up.

  • Okay.

  • So once we've got that, um, we want to update the cute table sort of sake you table at obs.

  • Action equals whatever that new Q is okay.

  • So at that point, we're actually done with the Q Learning.

  • Now what we want to do is show the environment if we want to show it.

  • So because we kind of want to see it over time and then also, we want to track some of the metrics and then maybe graph or whatever as needed.

  • So cute table obs actually is eunuch you.

  • So the next thing is, if show now we want to show it now, we haven't even created the environment, but it's just a grid, so it's pretty simple environment.

  • So we're gonna say the base environment is equal to N p dot zeros and it's a actually, it would be to pull here, uh, it's a size by size.

  • So a 10 by 10 by three because it's RGB data and then we're gonna say d type equals np dot You int eight.

  • So this is just basically 022 56 32 56 by three.

  • So let's fix that.

  • So now what we want to do is so it's all zeros.

  • So this would be an all black environment basically, and then we're just gonna mark.

  • So we're gonna say end, um and we're gonna save food dot Exe food dot Why equals in just kind of thinking now I think actually in, um so in an array, it's actually why, by X, I think, um, rather than x pi x y as we would traditionally think of X Y right, Because you've got your depth, then you're with basically, So I don't think of the environment, though it would, really it's gonna matter, like it's not gonna make any difference whether we flip that or not.

  • Because one, it's ah Square.

  • And too, we're just trying to visualize it would just be like rotated, I think, right, So I don't think it actually matters.

  • It would only matter if we were allowing user input, but I actually think it makes better sense to say why X off to fix that in the tutorial or the text based version.

  • Anyway, um, I think that's correct.

  • I think it would be.

  • Why then x in any array, anyway, there's a food y food.

  • X equals the dictionary food, and so that's just our way of getting the color.

  • Initially, I had a really great reason for that, but a cz, long as we only have one food it kind of makes.

  • Doesn't you Could just hard code this or you could just say food color like so you could just say the food color is something.

  • Um I had bigger plans for the dictionary in the night it changed.

  • So anyway, that would probably make more sense.

  • But she's show must go on.

  • So I've got food, then we have the player.

  • So we're gonna say player, player, Why Player X is the put palay er player in.

  • And then this will be the enemy at move this mouse and m e n.

  • And then here will be enemy dot Why and then enemy dot Exe.

  • Okay, so that will change that Grids color.

  • And by great I really it's really an image now, So it's ah, it's an RGB image so, but unfortunately, it's only a 10 by 10.

  • So So the next thing we're gonna do is we're gonna make this a national image image, and we're gonna say image equals capital.

  • I image dot from array and then end.

  • And then here's where I'm saying, Hey, this is our g Frick and B.

  • But for whatever reason, uh, it doesn't care.

  • It's B g R.

  • And the way I found that out is by is Yeah, let the player moves and the food and the enemy doesn't move and it's so clear it be gr 100% b j r.

  • It's crazy anyway, not if it's envy with the so that gives us our image.

  • Now what we're going to say is image equals image dot resize.

  • And then we're just gonna resize this too.

  • You can pick anything you want.

  • I'm gonna say 300 by 300 just so it's like I can see it and then CV to dot m show.

  • And then the title is whatever you want it to be numb pie array image.

  • And then I'm gonna throw in some kind of shoddy code here.

  • But that's okay.

  • Everybody accepts it.

  • If reward is equal to the food reward or the reward is equal to a negative enemy penalty, this means the simulation ended and we really screwed up.

  • So, um, so not if we didn't get the food.

  • But if we got the food No, we didn't screw up, necessarily.

  • If we got the food, yea or we hit the enemy really bad.

  • So if that's the case.

  • I just wanted to pause it just for a moment longer.

  • So I'm just gonna say if CV to dot wait key, and in this case, it's 500 milliseconds.

  • And then this little batch of awesomeness o X f f equals board que Basically, if you hit the cuchi it it breaks.

  • I'm gonna throw that in.

  • And this is just how you let that CV to window just refresh live basically.

  • So if we either made it to the food or we really screwed up by hitting the enemy, we want to run that.

  • Otherwise, Els, um, we'll just say Wait key one.

  • So it's frame for frame.

  • And then at the very end, if we either get the food or, um, this should be a double eagles should not be an assignment.

  • Um, yeah.

  • Okay, so the friends will go really, really quick unless we hit the enemy or get the food, and then we'll just hang there for a second just so we can catch it because this is one millisecond.

  • So Well, it's actually gonna be one milliseconds gonna take a little longer than that, but it's gonna be quick.

  • So, um, so That's our show.

  • So now we can see the environment.

  • Now what we're gonna say is episode, reward plus equals reward.

  • And then if reward is equal to the food reward or reward, uh is equal to the negative enemy penalty Uh, it's over.

  • We break.

  • And then finally, in our episode episode episode loop So just one tab over we're going to say episode rewards uh, uh dot upend episode reward And then we're gonna decay the EPS Alonso capsule on times equals R E P s decay.

  • And then finally, we just want to graft things at the very end.

  • So I'm going to say moving underscore average is equal to the number pi involved.

  • Uh, episode rewards Oh, so this is just create a moving average and numb pie ones, uh, show every and then, uh, divide buying show every And then there's this mode I don't even know.

  • It is time that over Yeah, mode is valid.

  • And I believe that would be, uh, mode equals valid.

  • So if you don't have enough to do the calculation, should you still do it or not?

  • I'm pretty sure that's what it that's gonna be.

  • But I could be wrong anyway is just a way for us to make moving average.

  • That's totally a Google, um, operation there.

  • Um, because you wanted to be like a rolling, moving average, you could also do it in chunks like we did in the previous environment.

  • Uh, I just think moving averages better idea.

  • So anyway, there's moving average, and then we're going to appeal to dot uh, Palat and then for X I'm just going to say eye for eye in range of the Len of moving average And then for or why we will plot moving average, beautiful.

  • And then peel tea dot Why label?

  • We're gonna say reward, uh, and then show every.

  • So the moving average that will use is whatever this show every is of 100,000 whatever it'll be 1000 moving average and so on appeal.

  • T p l t dot x label will say this is our episode number.

  • Peel tea does show.

  • And then finally with open and they will make an f string.

  • Um, and we're gonna say que table dash in its value of time, time dot pickle with open that pickle comma w B as f pickle.

  • Don't dump and we'll dump ORAC.

  • You table into F.

  • Lou.

  • Okay.

  • Coded that all the way through.

  • How many?

  • Uh, reward on food reward.

  • Oh, so this should be lower case, Lower case reward.

  • So what are the odds that I coded this without any other errors?

  • I don't know.

  • I don't know.

  • We're gonna find out, though.

  • We're gonna find out.

  • Um, let's go into here.

  • CMD python clips.

  • Python Cute learning for Come on, baby.

  • Okay, so it looks like we didn't do very well there, huh?

  • Do you have an error?

  • We still making it?

  • Warning.

  • Invalid.

  • I think.

  • I think we're fine.

  • Cool.

  • Awesome.

  • Show me my environment.

  • Doing it.

  • So So, Yeah.

  • This is your enemy.

  • This is your food.

  • This is your player.

  • So it's running up.

  • It looks like we got at that time.

  • So if you see a blue and red really quickly, that means we got to the food.

  • If you see green and red, that means we ran into the enemy a two very end.

  • So everything's happening sometimes very quickly.

  • So we're only on, eh?

  • Pursued what?

  • 9000?

  • Um, yeah.

  • So he's learning quick, huh?

  • It's been pretty good.

  • I forget how many episodes we said to go on.

  • Um, 25,000 I think.

  • Yeah, he's moving fast.

  • So anyway, the order that we drew things, uh, the reason I made that order was so I could figure out what happens at the end.

  • So these are all victories.

  • This is pretty cool.

  • See, uh, just waiting for the end so we could load it in, and I can show you guys, you know how we can start to kind of tinker tinker with things.

  • I also can bring up the text based origins tutorial for some other longer simulations as well.

  • Okay, so here's our graph we can see here.

  • We started these.

  • This is basically the rolling average, um, off of our accuracy or not.

  • Really, our accuracy are rewards, basically.

  • Oh, we type of reward.

  • Anyway, um, cool.

  • So you can see it definitely learned things really quickly and again, half of the time, it should not be able to get there, but it uses the wall to get there, which is super cool.

  • Um, okay.

  • So what I'm gonna do is close out of here.

  • I guess I'll just Oh, had to save Okay, so it saves our cue table.

  • So it means the ark.

  • You tables, like, 13.6 megabytes a CZ.

  • We continue to train because we're using pickle and a dictionary that will expand overtime.

  • And you can actually use that expansion to know how many more, rand.

  • Yeah.

  • How many more random initialized values are being updated?

  • Kind of an interesting, uh, outlook on my flaw of using a dictionary and pickling a dictionary as opposed to using numb pie correctly.

  • Anyway, Um, yeah.

  • So, uh oh.

  • So So let's say OK, we like what we were.

  • We think this is good.

  • So then we could load in the queue table.

  • So we copy this and then, rather than none, will load in the queue table itself.

  • So now will load this.

  • And then now let's just set Epsilon to zero.

  • Um, what else would I need to change?

  • Trying to think?

  • I think we're good, actually.

  • Let's say show everyone, uh, I think that's everything.

  • I'm fried missing something, But let's run it.

  • Let's see how we look.

  • So now you're seeing it play every time.

  • He's doing really good.

  • Get it, get it, get it.

  • Oh, beautiful.

  • Beautiful again.

  • More.

  • Ah, solved.

  • That's pretty cool.

  • Okay, so, um, I think I think you passed around.

  • So those those worries, like bouncing around.

  • Ah, that's a perfect example of the failure.

  • Because he can't move.

  • Um, as he Ah, man, either.

  • So cool.

  • Um but a lot of times you can see it actually go.

  • It's specifically it could have gone straight there, but instead are not really straight.

  • So I can only move diagonally, but it could like zigzag, right.

  • But instead it goes to the wall and bounces off the wall, and I I just can't believe it's learned behavior like that.

  • I just I just can't It's crazy.

  • But anyway, it has.

  • So now what I want to do Let me break.

  • This was everything I wanted to show.

  • Let's turn, Let's turn movement on and see how we do there.

  • So just this lips, this Let's save that and then run it again.

  • And so So he died there, uh, surviving now pretty well.

  • He's catching so fast.

  • That's just crazy.

  • Died there once.

  • Died again.

  • Ouch.

  • Anyway, um, yeah, that I just think this is really cool.

  • So I'm super happy that I've made the environment and I hope you guys air is impressed as I am with this because if I made a neural network, that was, like, say, a continent and it looked at this and it tried to make these learn to do this, um, it would take a while and this trained and like, I don't even know Like, what, three minutes?

  • I just think that's pretty cool.

  • I mean, I understand it is basic stuff, but, um, I think it's really cool.

  • So s O.

  • Now we could take things further and, like we could make a bigger environment, we could do more steps or whatever you want to dio and so feel free to tinker with the environment.

  • Um, the next they're the only other thing I want to show you guys is some of the results of using much, much larger, um uh, environment.

  • So So we're doing a 10 by 10.

  • If you just change that to a 20 by 20 you're cute.

  • Table size goes from about 15 megabytes to, like 250 megabytes, just for comparison sake.

  • And then if I figure out where the heck I promise you, there's some cool videos and photos.

  • We just have to get to them.

  • So let me see what this one is.

  • This one's probably what we were just looking at.

  • Yeah, um, and then I just showed how to how to get it live.

  • How to get movement.

  • Blah, blah, blah, blah.

  • Nobody cares here.

  • So here I start to actually train for 20 by 20 space.

  • So this is a 20 by 20 with no movement, I want to say, Yeah, so just seeing could it actually learned much larger environment, and obviously it does pretty well.

  • Uh, and then I turned on movement for both the enemy in the food.

  • Still no problem.

  • And some of those are pretty complex where it's gotta go around the enemy or deal with the enemy, which again is pretty cool that it can even do that.

  • Um, and I almost wonder if allowing movement makes it easier because he's always trying to get toward.

  • He was always trying to get towards the food.

  • And since the food is capable of moving in one's, like, up, down left, right, not just diagonally that that might actually even make it easier for the for the agent.

  • Let's say further than he.

  • OK, so that's our environment.

  • I hope you guys enjoyed, um, if you've got questions, comments, concerns, whatever you think I did something wrong, feel for you, let me know.

  • I I didn't think about all.

  • It can only move diagonally.

  • How is it actually getting food?

  • Until I showed the code to Daniel and he's like up.

  • It could only move diagonally.

  • How does it get the food, like half of the time?

  • And then I'm like, Oh, man, that's so cool because it uses the wall.

  • But it has no idea about boundaries like it doesn't know about the wall.

  • So I have no idea why that works, because again, it's just a cute It's like brute force, right?

  • It's just the fact that the house don't get it.

  • I don't know how it could learn to use the wall, or if that's just random, like if the wall being there,

what is going on?

字幕と単語

ワンタップで英和辞典検索 単語をクリックすると、意味が表示されます

B1 中級

強化学習(RL)環境の構築 - 強化学習 p.4 (Creating A Reinforcement Learning (RL) Environment - Reinforcement Learning p.4)

  • 2 0
    林宜悉 に公開 2021 年 01 月 14 日
動画の中の単語