字幕表 動画を再生する 英語字幕をプリント what is going on everybody and welcome to part three of the reinforcement learning as well Askew Learnings and says We're starting with tutorial Siris. In the last video we created A We got the mountain car to get up the hill basically using Q Learning, which is cool. But we did it with a bunch of parameters here or Constance that I just set because I knew it would work. But in the real world, you're gonna have to figure these values out. And the way that you're going to do that can't possibly be watching every episode. And seeing doesn't work because, you know, showing, let's say, even with 25,000 episodes showing every 2000 that's still not a statistically significant sampling of how your models doing so So we have to find a better way. And one thing that we can do just a simple metric to track is just simply the reward. So proper episode reward tracking system is probably enough for most like basic operations. Now, for really complicated environments, you might need to do a little more, But ah, you're not gonna use basic que learning for a complicated environment anyway. So with that. Let me just show you guys one way you could track the these metrics overtime. Now, there's a 1,000,000 ways you could do. This is just a simple programming task, honestly, but it is something we definitely need do. So I'm just gonna show one example. And then after that example, I'm gonna show you guys, um something kind of cool with the cute tables at the end is a just a slight bonus. So, uh, anyway, the first thing we're gonna do here is just come underneath que table. Also, let's change episodes. Let's do, uh let's just do 2000 episodes show every 500 episodes. Ah, that will just help us to get through this this year. So the first thing that we're gonna dio is just underneath que table. Let's just create a couple of new lips, a couple of new values. I meant to just zoom in a little bit there. So first we're gonna episode rewards. This will just be a list that contains each episode's reward just as a list on. And then we're gonna have an aggregate anger, get rewards dictionary, and this dictionary is gonna track the episode number basically, uh, then we're gonna have the average, and then we'll have the men, and then we'll have Pepe troubles will have Max. Okay, so this is good. This will just be a dictionary that tracks episode number. So this would just serve is like an ex. Yeah, the X axis for a graft. Basically, the average this will be the trailing average. So you almost think of it like I'm not a moving average, but it's the average for any given windows. So every 500 episodes, for example, this will average over time. So as our model improves, average should go up minimum. This is just gonna be tracking for every show. Every what was the worst model we had. So, um, s so so basically, it's just what's the worst? It's not hurt concept, Max. What was the best one? So why do we want these? Well, the average might actually be going up, but the minimum or the worst performing model is still in the dump. And so you might have cases where you actually prefer that the worst model is still somewhat decent than toe have the highest average or something like that. So this is just barely getting into it. This is, You know, what you're gonna actually be looking for trying to optimize for is gonna vary depending on what kind of task you're attempting to achieve. So I'm just gonna keep it fairly simple here on Just go with that. So the next thing we're gonna do is we're gonna come down into the episode loop here, and we're gonna track at this sewed under scorer. Reward were going to say that equals zero. Then we're gonna come down to the adoration of steps. Uh, which is here on then we get a reward here, and what we want to do is add that reward. So episode underscore reward plus equals reward. Uh, then what we want to do? What's your deal? I guess because we're just not using it. Yeah, I'm to find episode. Wait. Whoa, Epic. Always spend every type of episode reward EP, sepsis owed episode reward. Okay, coming down here. Episode reward plus equals reward. Phenomenal. Okay, so that will add to episode reward every time. And then basically, at the end of the episode, what we'd like to do is upend episode reward to EP rewards. So we'll come to the very end of this loop. Come back here. And the first thing that we're going to say is EP underscore ep rewards. I don't have my keyboards nine or what? I'm pretty sure I didn't make that typo. Uh Yep, rewards dot upend Although EPPS Rohwer episode rewards is a little hard to explain with the keyboard issue, but anyway, EP rewards don't upend. Um, we want to upend that episode reward. So the total reward at the very end, what do we want to do? So then the next thing we're going to say is, if not episode, uh, episode Modelo show every in that for just for the That's the same as saying if episode ma djalo show every double equal zero. So if it doesn't equal zero so you can kind of short on this just by saying if not episode module o show every basically it just means every show, every do this thing. So I can't tell you how many times in python I need to perform a task like this, and it's kind of unfortunate that this is the like industry standard. It would it would be kind of nice to have some sort of way to be like, like, every. Like, Would that be a nice statement in python? Every just saying something like that. Anyway, Make it happen, guys. Uh, if not, episode show every. So now what we want to do is we actually want to build, uh, we're gonna work on our dictionary. So basically, first thing we're gonna do is calculate the average reward and that's going to be equal to the sum of our episode room. It underscore rewards. There we go. It'll be the sum of EP rewards minus the show, every colon and then divided by. And I know some people were gonna be like, Why don't you just divide my show every well, in this case, some like this minus show. Every colon just means like the last let's say 500. But if the list is only 300 long and we do the last 500 it's still only gonna be 300 long. Either way, we shouldn't run into that as a problem. But just in case I'm going to instead say, the Len of EP rewards minus show every cool, so that gives us our average reward and now we're ready to actually populate that dictionary. So it was like aggregate EP rewards. And then basically, we're gonna have, um, four things that we're gonna do. First of all, the episode that's just gonna be equal to or not equal to, we're gonna say upend episode Oh, episode. And then I'm gonna take this. I'm gonna copy pasta, pasta, pasta, average men and Max. And so here, we're going to upend the average than here. I'm gonna copy and paste this pepper words minus show Every colon thing, paste, paste. And this one will be. Was that men? So the minimum of that value and then here we're going to say, Max. Cool, beautiful. So now we've built this dictionary. The next thing that would be useful possibly is to print an F string. Um, so we can just just for this specific episode, we could print, like, all of these things so we could say, uh, episode, colon, Uh, and then average colon, uh, men Cole and and then Max Colon. And then we'll just copy and paste into here. Cut. Copy, please. Thank you. Thank you, men. Max, a copy this paste, and then do the same thing for Max. Here, Copy. Taste beautiful. Okay, so that will give us some some metrics, like as its training for us. So we can kind of see how things were going besides just seeing the simple replay. Because, like I said, that's just probably not gonna be enough. Now, at the very end, we're gonna graph it. So at the very tippy top we're gonna import, Matt plot lived up pipe lot as pl And then we're gonna come down to the bottom again, and, um, we can close the environment. We just shouldn't be a problem. We'll do it here appeal t dot uh dot plot. And then we want a plot. Basically all three of these combinations. So, basically, it'll be the ex will always be Theo episode. Uh, and then we'll do a V G. And then we're gonna say label equals a V G. And then I'm gonna copy this paste paste, and then we'll do men men will do Max Max, And then we'll do a peel tea dot legend and will say the location here is four. And then finally peel tea dot show. And for the location that just you can kind of pick where you want the legend to go, and I'm gonna put it in four, which just means the lower right. So in theory, everything should be going up over time. So hopefully the lower right isn't in the way if it's in the way, we didn't do very well anyways, So we'll say that, Uh, and what I want to do is open up a command line. Python que learning three will get that at least started. I'm gonna move this, like, here so we can hopefully see the metrics up. Shoot is what I was going to say. List index at a range. Rewards rewards, Dewey, not upend er up. Rewards dot upend episode reward. We are a pending. We've must have done something stupid. That was a reward. Equal zero episode reward plus equals reward At the very end, Epper words don't upend episode rewards. Ah, Then hold on. Um, I'm just not seeing what the issue is here. List index. Omanis show every. Okay, I see it. I see it. So coming back down here. So the issue was minus show every colon. Ah, simple error. Let's just make sure we didn't make that anywhere else. Okay, I think we're good. So let's try one more time. Cool. Okay, So everything's native. 200 here? Uh, not too. Too shocking. So while we wait for this to do what, 2000 episodes, I'm going to give a shadow to my most recent brand new channel members, Mike Smith, A Jay's Sheeni. Santa knew Bo, Mick and Harv demand. Thank you, guys for your support. Welcome to the club. You guys are awesome. So it looks like we are almost done at 1500. While that's stinks. He, like, almost made it to that flag. Uh, okay, hopefully, at this point, we'll get a beautiful graph. Wow. Look at that. Look at that. Was that average or Max? I can't even tell that I think that's average. Or that's Max. Rather, of course, it's Max. It's the woman's on top. I was just testing you guys, to be honest. Um, okay, so we can see here that things are improving now. We only did 2000 steps, so it's no surprise that, um, you know, the max episode was doing pretty good, but the minimum is still gonna be negative. 200 like it just never made it to the flag. Now we could continue. I'm just not gonna waste your time in, like, you know, graph of super really long one and just wait for things to operate through. I've actually already done it. So, um so this is, uh Well, this is a video. I was gonna show you guys and also show it to you. I'll just leave it in a different tab. Instead, we're gonna pop over, uh, here, and I'm gonna just full screen this thing, and I'm gonna scroll down. So this is kind of basically what we were just looking at, and then I went a little longer. Um, actually, I'm not even sure. What did I change here? Oh, this was just with an epsilon change. So one of the things that you would use these graphs fours is like changing. Like, how does changing the epsilon decay value change? How does this start and stop episode, number, change things, all that kind of stuff. And then also you can train a model for a little bit without an Absalon. Then add the Absalon. How does that change things? So you could do all kinds of stuff like this totally up to you what you want to do. So then, even after 10,000 episodes, you can see the minimum you like. I said it got in the way because it didn't do very good. But there is a little tick up right here where the minimum was at minus 200. But on. And then again, this was back to changing the Epsilon to King value. You can see how that change some things. Ah. Then here I changed the discreet observation size from 20 to 40 on, and we can see here that at least the improvement is like super linear. Like here. You can see it. It goes up and then it kind of flatlines plateaus. Ah, where is here? It definitely is constantly improving, like to a point in continuing past or least looks like it's gonna continue past any other point we've ever had. And for the same number of episodes. So then I'm like, Okay, let's train it for really long. So we did 25,000 episodes and we could see maybe, at least for this setting, the sweet spots around 20,000 because we can see the worst agents were still, uh, at least making it to the flag. Because if they don't make it the flag, it's going to be a minus 200. So as long as they eventually make it to the flag, that's pretty good. So for, you know, after about, you know, just before 20,000 to, I don't know, maybe 22,000 or something. Um, for 2000 episodes we found or more than 2000 says I am probably 3000. Um, it never failed. It was successful every single time. And then for whatever reason, it's first come back down again. So anyway, so you can kind of see how things changed. And then from here, you could start tweaking like the Epsilon value, Or you could reintroduce up salon. Christians are by this time the Epsilon is not there anymore. So maybe we want to reintroduce Absalon because maybe the agent is finding itself in a position it's never been in before. So that value has, like, never been tweaked. Right? So stuff like that, um okay, And then here is where we start talking about saving for acute tables. So basically, is just a lump. I save And then you can just save the cute table as whatever you want. So, for example, just Well, yeah, you could save per episode, but that would be a lot of cute tables. So instead, what I'm doing here is that module Oh, thing.