Placeholder Image

字幕表 動画を再生する

  • OSCAR RAMIREZ: All right.

  • Well, thank you, everyone.

  • So I'm Oscar Ramirez.

  • This is Sergio.

  • And we're from the TF-Agents team.

  • And we'll talk to you guys about our project.

  • So for those of you that don't know,

  • TF-Agents is our reinforcement Learning library

  • built in TensorFlow.

  • And it's hopefully a reliable, scalable,

  • and easy-to-use library.

  • We packaged it with a lot of Colabs, examples,

  • and documentation to try and make it easy for people

  • to jump into reinforcement learning.

  • And we use it internally to actually solve

  • a lot of difficult tasks with reinforcement learning.

  • In our experience, it's been pretty easy to develop new RL

  • algorithms.

  • And we have a whole bunch of tests,

  • making it easy to configure and reproduce results.

  • A lot of this wouldn't be possible without everyone's

  • contribution, so I just want to make it clear,

  • this has been a team effort.

  • There has spent a lot of 20 percenters,

  • external contributors.

  • People have come and gone within the team, as well.

  • And so this is right now the biggest chunk

  • of the current team that is working on TF-Agents.

  • With that, I'll let Sergio talk a bit more about RL in general.

  • SERGIO GUADARRAMA: Thank you, Oscar.

  • Hi, everyone.

  • So we're going to focus a little more about reinforcement

  • learning and how this is different from other kinds

  • of machine learning-- unsupervised learning,

  • supervised learning, and other flavors.

  • Here's three examples.

  • One is a robotics game.

  • And the other one is a recommendation system.

  • That's a clear example where you can

  • apply reinforcement learning.

  • So the basic idea is--

  • so if you were to try to teach someone how to walk,

  • it's very difficult, because it's really difficult for me

  • to explain to you what you need to do to be able to walk--

  • coordinate your legs, in this case, of the robot-- or even

  • for a kid.

  • How you teach someone how to walk is really difficult.

  • They need to figure it out themselves.

  • How?

  • Trial and error.

  • You try a bunch of times.

  • You fall down.

  • You get up, and then you learn as you're falling.

  • And that's basically-- you can think of it like the reward

  • function.

  • You get a positive reward or a negative reward

  • every time you try.

  • So here, you can see also, even with the neural algorithms,

  • this thing is still hopping, no?

  • After a few trials of learning, this robot

  • is able to move around, wobble a little bit, and then fall.

  • But now he can control the legs a little more.

  • Not quite walk, but doing better than before.

  • After fully-trained, then, the robot

  • is able to walk from one place to another,

  • basically go to a specific location, and all those things.

  • So how this happen basically is summarizing this code.

  • Well, there's a lot of code, but overall, the presentation

  • will go about the details.

  • Basically, we're summarizing all the pieces

  • you will need to be able to train a model like this,

  • but we will go into details.

  • So what is reinforcement learning,

  • and how that is different, basically, for other cases?

  • The idea is we have an agent that

  • is trying to play, in this case, or interact

  • with an environment.

  • In this case, it's like Breakout.

  • So basically, the idea is you need

  • to move the paddle to the left or to the right to hit the ball

  • and break the bricks on the top.

  • So this one generates some observation

  • that the agent can observe.

  • It can basically process them for observation,

  • generate a new action-- like whether to move the paddle

  • to the left to the right.

  • And then based on that, they will get some reward.

  • In this case, it will be the score.

  • And then, using that information,

  • it will learn from this environment how to play.

  • So one thing with this, I think, is

  • critical for people who have done

  • a lot of supervised learning is, what

  • is the main difference between supervised learning

  • and reinforcement learning-- is that for supervised learning,

  • you can think of, for every action

  • that you take, they give you a label.

  • An expert will have labeled that case.

  • That is simple.

  • It'll give you the right answer.

  • For this specific image, this is an image of a dog.

  • This is an image of a cat.

  • So you know what is the right answer,

  • so every time you make a mistake,

  • I will tell you what is the right answer to that question.

  • In reinforcement learning, that doesn't happen.

  • Basically, you are playing this game.

  • You are interacting with the game.

  • You [? bench ?] a batch of actions,

  • and you don't know which one was the right action, what

  • was the correct action, and what was the wrong one.

  • You only know this reward function tells you, OK, you

  • are doing kind of OK.

  • You are not doing that well.

  • And based on that, you need to infer, basically,

  • what other possible actions you could have taken to improve

  • your reward, or maybe you're doing well now, but maybe later

  • you do worse.

  • So it's also a dynamic process going on over here.

  • AUDIENCE: How is the reward function

  • different from the label?

  • SERGIO GUADARRAMA: So I think the main difference is this.

  • The reward function is only an indicator

  • you are doing well or wrong, but it

  • doesn't tell you what is the precise action you

  • need to take.

  • The label is more like the precise outcome of the model.

  • You can think, in supervised learning,

  • I tell you what is the right action.

  • I tell you the right answer.

  • If I give you a mathematical problem, I'm going to say,

  • x is equal to 2.

  • That is the right answer.

  • If I tell you, you are doing well,

  • you don't know what was the actual answer.

  • You don't know if it was x equal 2 or x equal 3.

  • If I tell you it's the wrong answer,

  • you're going to know what was the right answer.

  • So basically that's the main difference between having

  • a reward function that only indicates--

  • it gives you some indication about whether you are doing

  • well or not, but doesn't give you the proper answer--

  • or the optimal answer, let's say.

  • AUDIENCE: Is the reward better to be very general instead

  • of very specific?

  • SERGIO GUADARRAMA: Mhm.

  • AUDIENCE: Like you are doing well,

  • instead of what you are moving is the right direction to go.

  • OSCAR RAMIREZ: It depends quite a bit on the environment.

  • And there is this whole problem of credit assignment.

  • So trying to figure out what part of your actions

  • were the ones that actually led to you receiving this reward.

  • So if you think about the robot hopping,

  • you could give it a reward, that may be

  • its current forward velocity.

  • And you're trying to maximize that,

  • and so the robot should learn to run as fast as possible.

  • But maybe bending the legs down so

  • you can push yourself forward will help you move forward

  • a lot faster, but maybe that action will actually move you

  • backwards a little bit.

  • And you might even get punished instantaneously

  • for that action, but it's part of the whole set of actions

  • during an episode that will lead you to moving forward.

  • And so the credit assignment problem is like,

  • all right, there are a set of actions

  • that we might have even gotten negative reward,

  • but we need to figure out that those actions led

  • to positive reward down the line.

  • And the objective is to maximize the discounted return.

  • So a sum of rewards over a length of time steps.

  • SERGIO GUADARRAMA: Yeah, that's a thing that's a critical part.

  • We care about long-term value.

  • It's not totally immediate reward.

  • It's not only you telling me plus 1.

  • It's not so important, because I want

  • to know not if I'm playing the game right now.

  • If I'm going to win the game at the end,

  • that's where I really care.

  • I am going to be able to move the robot to that position.

  • What we're in the middle, sometimes those things are OK.

  • Some things are not bad.

  • But sometimes, I make an action.

  • Maybe I move one leg and I fall.

  • And then I could not recover.

  • But then maybe it was a movement I did 10 steps ago

  • would make my leg wobble.

  • And now how do I connect which action made me fall.

  • Sometimes it's not very clear.

  • Because it's multiple actions-- in some cases,

  • even thousands of actions-- before you

  • get to the end of the game, basically.

  • You can think also that, in the games, now

  • that I've gotten all those things,

  • is this [? stone ?] really going to make you lose?

  • Probably there's no single [? stone ?]

  • that's going to make you lose, but 200 positions down

  • the line, that [? stone ?] was actually very critical.

  • Because that has a ripple effect on other actions

  • that happen later.

  • And then to you need to be able to estimate this credit

  • assignment for which actions I need to change to improve

  • my reward, basically, overall.

  • So I think this is to illustrate also a little farther,

  • different models of learning.

  • What we said before, supervised learning

  • is more about the classical classroom.

  • There's a teacher telling you the right

  • answer, memorize the answer, memorize that.

  • And that's what we do in supervised learning.

  • We almost memorize the answers with some generalization.

  • Mostly that's what we do.

  • And then in reinforcement learning,

  • it's not so much about memorize the answer.

  • Because even if I do the same actions,

  • in a different setting, if I say,

  • OK, go to the kitchen in my house, and I say,

  • oh, go to the left, second door to the right.

  • And then I say, OK, now go to [? Kate's ?] house

  • and go to the kitchen.

  • If you apply the same algorithm, you

  • will get maybe into the bathroom.

  • Like, you'll go two doors to the right,

  • and then you go to the wrong place.

  • So even memorizing the answer is not good enough.

  • You know what I mean?

  • You need to adapt to the environment.

  • So that's what makes reinforcement learning

  • a little more challenging, but also more

  • applicable to many other problems in reality.

  • You need to play around.

  • You need to interact with the environment.

  • There's no such a thing as, I can

  • think about what's going to be the best plan ahead of time

  • and never play with the environment.

  • We tried to write down some of these things

  • that we just mentioned, about that you

  • need to interact with the environment

  • to be able to learn.

  • This is very critical.

  • If you don't interact, if you don't try to walk,

  • if the robot doesn't try to move,

  • it cannot learn how to walk.

  • So you need to interact with the environment.

  • Also it will put you in weird positions and weird places,