Placeholder Image

字幕表 動画を再生する

  • Hey, I'm Jabril and welcome to Crash Course AI.

  • Say I want to get a cookie from a jar that's on a tall shelf.

  • There isn't oneright wayto get the cookies.

  • Maybe I find a ladder, use a lasso, or build a complicated system of pulleys.

  • These could all be brilliant or terrible ideas, but if something works, I get the sweet taste

  • of victory... and I learn that doing that same thing could get me another cookie in

  • the future.

  • We learn lots of things by trial-and-error, and this kind oflearning by doingto

  • achieve complicated goals is called Reinforcement Learning.

  • INTRO

  • So far, we've talked about two types of learning in Crash Course AI: Supervised Learning,

  • where a teacher gives an AI answers to learn from, and Unsupervised Learning, where an

  • AI tries to find patterns in the world.

  • Reinforcement Learning is particularly useful for situations where we want to train AIs

  • to have certain skills we don't fully understand ourselves.

  • For example, I'm pretty good at walking, but trying to explain the process of walking

  • is kind of difficult.

  • What angle should your femur be relative to your foot?

  • And should you move it with an average angular velocity ofyeah, never mindits really

  • difficult.

  • With reinforcement learning, we can train AIs to perform complicated tasks.

  • But unlike other techniques, we only have to tell them at the very end of the task if

  • they succeeded, and then ask them to tell us how they did it.

  • (We're going to focus on this general case, but sometimes this feedback could come earlier.

  • So if we want an AI to learn to walk, we give them a reward if they're both standing up

  • and moving forward, and then figure out what steps they took to get to that point.

  • The longer the AI stands up and moves forward, the longer it's walking, and the more reward

  • it gets.

  • So you can kind of see how the key to reinforcement learning is just trial-and-error, again and

  • again.

  • For humans, a reward might be a cookie or the joy of winning a board game.

  • But for an AI system, a reward is just a small positive signal that basically tells itgood

  • jobanddo that again”!

  • Google Deepmind got some pretty impressive results when they used reinforcement learning

  • to teach virtual AI systems to walk, jump, and even duck under obstacles.

  • It looks kinda silly, but works pretty well!

  • Other researchers have even helped real life robots learn to walk.

  • So seeing the end result is pretty fun and can help us understand the goals of reinforcement

  • learning.

  • But to really understand how reinforcement learning works, we have to learn new language

  • to talk about these AI and what they're doing.

  • Similar to previous episodes, we have an AI (or Agent) as our loyal subject that's going

  • to learn.

  • An agent makes predictions or performs Actions, like moving a tiny bit forward, or picking

  • the next best move in a game.

  • And it performs actions based on its current inputs, which we call the State.

  • In supervised learning, after /each/ action, we would have a training label that tells

  • our AI whether it did the right thing or not.

  • We can't do that here with reinforcement learning, because we don't know what the

  • right thingactually is until it's completely done with the task.

  • This difference actually highlights one of the hardest parts of reinforcement learning

  • called credit assignment.

  • It's hard to know which actions helped us get to the reward (and should get credit)

  • and which actions slowed down our AI when we don't pause to think after every action.

  • So the agent ends up interacting with its Environment for a while, whether that's

  • a game board, a virtual maze, or real life kitchen.

  • And the agent takes many actions until it gets a Reward, which we give out when it wins

  • a game or gets that cookie jar from that really tall shelf.

  • Then, every time the agent wins (or succeeds at its task), we can look back on the actions

  • it took and slowly figure out which game states were helpful and which weren't.

  • During this reflection, we're assigning Value to those different game states and deciding

  • on a Policy for which actions work best.

  • We need Values and Policies to get anything done in reinforcement learning.

  • Let's say I see some food in the kitchen: a box, a small bag, and a plate with a donut.

  • So my brain can assign each of these a value, a numerical yummy-ness value.

  • The box probably has 6 donuts in it, the bag probably has 2, and the plate just has 1…

  • so the values I assign are 6, 2, and 1.

  • Now that I've assigned each of them a value, I can decide on a policy to plan what action

  • to take!

  • The simplest policy is to go to the highest value (that box of possibly 6 donuts).

  • But I can't see inside of it, and that could be a box of bagels, so it's high reward

  • but high risk.

  • Another policy could be low reward but low risk, going with the plate with 1 guaranteed

  • delicious donut.

  • Personally, I'd pick a middle-ground policy, and go for the bag because I have a better

  • chance of guessing that there are donuts inside than the box, and a value of 1 donut isn't

  • enough.

  • That's a lot of vocab, so let's see these concepts in action to help us remember everything.

  • Our example is going to focus on a mathematical framework that could be used with different

  • underlying machine learning techniques.

  • Let's say John-Green-bot wants to go to the charging station to recharge his batteries.

  • In this example, John-Green-bot is a brand new Agent, and the room is the Environment

  • he needs to learn about.

  • From where he is now in the room, he has four possible Actions: moving up, down, left, or

  • right.

  • And his State is a couple of different inputs: where he is, where he came from, and what

  • he sees.

  • For this example, we'll assume John-Green-bot can see the whole room.

  • So when he moves up (or any direction), his state changes.

  • But he doesn't know yet if moving up was a good idea, because he hasn't reached a

  • goal.

  • So go on, John-Green-bot... explore!

  • He found the battery, so he got a Reward (that little plus one).

  • Now, we can look back at the path he took and give all the cells he walked through a

  • Value -- specifically, a higher value for those near the goal, and lower for those farther

  • away.

  • These higher and lower values help with the trial-and-error of reinforcement learning,

  • and they give our agent more information about better actions to take when he tries again!

  • So if we put John-Green-bot back at the start, he'll want to decide on a Policy that maximizes

  • reward.

  • Since he already knows a path to the battery, he'll walk along that path, and he's guaranteed

  • another +1.

  • But that's… too easy.

  • And kind of boring if John-Green-bot just takes the same long and winding path every

  • time.

  • So another important concept in reinforcement learning is the trade-off between exploitation

  • and exploration.

  • Now that John-Green-bot knows one way to get to the battery, he could just exploit this

  • knowledge by always taking the same 10 actions.

  • It's not a terrible idea -- he knows he won't get lost and he'll definitely get

  • a reward.

  • But this 10-action path is also pretty inefficient, and there are probably more efficient paths

  • out there.

  • So exploitation may not be the best strategy.

  • It's usually worth trying lots of different actions to see what happens, which is a strategy

  • called exploration.

  • Every new path John-Green-bot takes will give him a bit more data about the best way to

  • get a reward.

  • So let's let John-Green-bot explore for 100 actions, and after he completes a path,

  • we'll update the values of the cells he's been to.

  • Now we can look at all these new values!

  • During exploration, John-Green-bot found a short-cut, so now he knows a path that only

  • takes 4 actions to get to the goal.

  • This means our new policy (which always chooses the best value for the next action) will take

  • John-Green-bot down this faster path to the target.

  • That's much better than before, but we paid a cost, because during those 100 actions of

  • exploration, he took some paths that were even /more/ inefficient than the first 10-action

  • try and only got a total of 6 points.

  • If John-Green-bot had just exploited his knowledge of the first path he took for those 100 actions,

  • he could have made it to the battery 10 times and gotten 10 points.

  • So you could say that exploration was a waste of time.

  • BUT if we started a new competition between the new John-Green-bot (who knows a 4-action

  • path) and his younger, more foolish self (who knows a 10-action path), over 100 actions,

  • the new John-Green-bot would be able to get 25 points because his path is much faster.

  • His reinforcement learning helped!

  • So should we explore more to try and find an even better path?

  • Or should we just use exploitation right away to collect more points?

  • In many reinforcement learning problems, we need a balance of exploitation and exploration,

  • and people are actively researching this tradeoff.

  • These kinds of problems can get even more complicated if we add different kinds of rewards,

  • like a +1 battery and a +3 bigger battery.

  • Or there could even be Negative Rewards that John-Green-Bot needs to learn to avoid, like

  • this black hole.

  • If we let John-Green-Bot explore this new environment using reinforcement learning,

  • sometimes he falls into the black hole.

  • So the cells will end up having different values than the earlier environment, and there

  • could be a different best policy.

  • Plus, the whole environment could change in many of these problems.

  • If we have an AI in our car helping us drive home, the same road will have different people,

  • bicycles, cars, and black holes on it every day.

  • There might even be construction that completely reroutes us.

  • This is where reinforcement learning problems get more fun, but much harder.

  • When John-Green-bot was learning how to navigate on that small grid, cells closer to the battery

  • had higher values than those far away.

  • But for many problems, we'll want to use a value function to think about what we've

  • done so far, and decide on the next move using math.

  • For example, in this situation where an AI is helping us drive home, if we're optimizing

  • safety and we see the brake lights of the car in front of us, it's probably time to

  • slow down, but if we saw a bag of donuts in the street, we would want to stop.

  • So reinforcement learning is a powerful tool that's been around for decades, but a lot

  • of problems need a ton of data and a ton of time to solve.

  • There have been really impressive results recently thanks to deep reinforcement learning

  • on large-scale computing.