字幕表 動画を再生する
[MUSIC PLAYING]
SERGIO GUADARRAMA: Today, we are going
to talk about reinforcement learning, how you can apply
to many different problems.
So hopefully, by the end of the talk,
you will know how to use reinforcement learning
for your problem, for your applications, what
other things we are doing at Google with all
these new technology.
So let me go a little bit--
do you remember when you try to do something difficult
that was hard that you need to try a lot?
For example, when you learned how to walk, do you remember?
I don't remember.
But it's pretty hard because nobody tells you
exactly how to do it.
You just keep trying.
And eventually, you're able to stand up, keep the balance,
wobble around, and start walking.
So what if we want to teach this cute little robot how to walk?
Imagine-- how you will do that?
How would you tell this robot how to walk?
So what we are going to do today is
learn how we can do that with machine learning.
And the reason for that is because if we
want to do this by calling a set of rules,
it will be really hard.
What kind of rules would we put in code that can actually
make this robot the walk?
We have to do coordination, balance.
It's really difficult. And then they probably
would just fall over.
And we don't know what to change in the code.
Instead of that, we're going to use machine
learning to learn from it.
So the agenda for today is going to be this.
We are going to cover very quickly what
is supervised learning, reinforcement learning, what
is TF-Agents, these things we just talk about it.
And we will go through multiple examples.
So you can see we can build up different pieces to actually
go and solve this problem, teach this robot how to walk.
And finally, we will have some take home methods that you
can take with you all today.
So how many of you know what is supervised learning?
OK.
That's pretty good.
For those of you who don't know, let's go
to a very simple example.
So we're going to have some inputs, in this case,
like an image.
And we're going to pass through our model,
and we're going to put in some outputs.
In this case, there's going to be a cat or the dog.
And then, we're going to tell you what is the right answer.
So that's the key aspect.
In supervising learning, we tell you the label.
What is the right answer?
So you can modify your model and learn from these mistakes.
In this case, you might use a neural net.
We have a lot of ways that you can learn.
And you can modify those connections
to basically learn over time what is the right answer.
The thing that supervised learning need
is a lot of labels.
So many of you probably heard about IMAGENET.
It's a data set collected by Stanford.
It took like over two years and $1 million
to gather all this data.
And they could annotate millions of images with labels.
Say, in this image, there's a container received.
There's a motor scooter.
There's a leopard.
And then, you label all these images
so your model can learn from it.
And that worked really well where
you can have all these labels, and then you
can train your model from it.
The question is like, how will you provide the labels
for this robot?
What is the right actions?
I don't know.
It's not that clear.
What will be the right answer for this case?
So we are going to take a different approach, what
is like reinforcement learning.
Instead of trying to provide the right answer--
like in a classical setting, you will go to class,
and they tell you what is the right answers.
You know, you study, this is the answer for this problem.
We already know what is the right answer.
In reinforcement learning, we assume
we don't know what is the right answer.
We need to figure it out ourselves.
It's more like a kid.
It's playing around, putting these labels together.
And eventually, they're able to stack it up together, and stand
up.
And that gives you like some reward.
It's like, oh, you feel proud of it, and then you keep doing it.
Which are the actions you took?
Not so relevant.
So let's formalize a little more what reinforcement learning is
and how you can actually make these
into more concrete examples.
Let's take a simpler example, like this little game
that you're trying to play.
You want to bounce the ball around, move the pile
at the bottom left or right, and then you
want to hit all these bricks, and play this game,
clear up, and win the game.
So we're going to have this notion of an agent
or program that's going to get some reservation.
In this case, a friend is going to look at the game.
What is the ball, where are the brakes, what is the puzzle,
and take an action.
I'm going to move to the left or I'm going to move to the right.
And depending where you move, the ball will drop,
or you actually start keeping the ball bouncing back.
And we're going to have this notion of reward,
what is like when you do well, we
want you to get positive reward, so you reinforce that behavior.
And when you do poorly, you will get negative reward.
So we can define simple rules and simple things
to basically call this behavior as a reward function.
Every time you hit a brick, you get 10 points.
Which actions do you need to do to hit the brick?
I don't tell you.
That's what you need to learn.
But if you do it, I'm going to give the 10 points.
And if you clear all the bricks, I'm
going to give you actually a hundred
points to encourage you to actually play
this game very well.
And every time the ball drops, you
lose 50 points, which means, probably not
a good idea to do that.
And if you let the ball drop three times, game is over,
you need to stop the game.
So the good thing is about the reinforcement learning,
you can apply to many different problems.
And here are some examples that over the last year people
have been applying reinforcement learning.
And it goes from recommender instance in YouTube, data
set to cooling, real robots.
You can apply to math, chemistry,
or a cute little robot in the middle, and things
as complex as they go.
Like DeepMind applied to AlphaGo and beat
the best player in the world by using reinforcement learning.
Now, let me switch a little bit to TF-Agents and what it is.
So main idea of TF-Agents like doing reinforcement learning
is not very easy.
It requires a lot of tools and a lot of things
that you need to build on your own.
So we built this library that we use at Google,
and we open source so everybody can
use it to make reinforcement learning a lot easier to use.
So we make it very robust.
It's scalable, and it's good for beginners.
If you are new to RL, we have a lot
of notebooks, example documentation
that you can start working on.
And also, for complex problems, you
can apply to real complex problems
and use it for realistic cases.
For people who want to create their own algorithm,
we also make it easy to add new algorithms.
It's well tested and easy to configure.
And furthermore, we build it on top of TensorFlow 2.0
that you probably heard over at Google I/O before.
And we make it in such a way so it's developing and debugging
is a lot easier.
You can use see TF-Eager mode and Keras and TF functions
to make things a lot easier to build.
Very modular, very extensible.
Let me cover a little bit the main pieces of the software,
so then when we go through the examples,
you have a better sense.
On the left side, we have all the data collection.
When we play this game, we are going to collect data.
We are going to play the game.
We're collecting data so we can learn from it.
And on the right side, we're going
to have a training pipeline.
When we have the data, like a data set,
or log in, or games we play, we're going to transfer any
proof or model-- in this case, the neural net--
I'm going to deploy, collect more data, and repeat.
So now, let me hand it over to Eugene,
who is going to go over the CartPole example.
EUGENE BREVDO: Thanks, Sergio.
Yeah, so the first example we're going to go over
is a problem called Cartpole.
This is one of the classical control problems
where imagine that you have a pole in your hand,
and it wants to fall over because of gravity.
And you kind of have to move your hand left and right
to keep it upright.
And if it falls over, then game over.
If you move off the screen by accident, then game over.
So let's make that a little bit more concrete.
In this environment, the observation is not the images
that you see here.
Instead, it's a four vector containing
angles and velocities of the pole and the cart.
The actions are the values 0 and 1
representing being able to take a left or a right.
And the reward is the value 1.0 every time step or frame
that the pole is up and hasn't fallen over more
than 15 degrees from vertical.
And once it has, the episode ends.
OK, so if you were to implement this problem or environment
yourself, you would subclass the TF-Agents by environment class,
and you would provide two properties.
One is called the observation aspect,
and that defines what the observations are.
And you would implement the action spec property,
and that describes what actions the environment allows.
And there are two major methods.
One is reset, which resets the environment
and brings the pole back to the center and vertical.
And the set method, which accepts the action and updates
any internal state and emits the observation and the reward
for that time stamp.
Now, for this particular problem,
you don't have to do that.
We support OpenAi, which is a very popular framework
for environments in Python.
And you can simply load CartPole from that.
That's the first line.
And now you can perform some introspection.
You can interrogate the environment, say what is
Observation Spec.
Here, you can see that it's a forward vector of floating
point values.
Again, describing the angle and velocities of the pole.
And the Action Spec is a scalar integer
picking on values 0 and 1, representing left and right.
So if you had your own policy that you had built,
maybe a scripted policy, you would
be able to interact with the environment
by loading it, building your policy object,
resetting the environment to get an initial state,
and then iterating over and over again,
passing the observation or the state to the policy,
getting an action from that, passing the action back
to the environment, maybe calculating your return, which
is the sum of the rewards or all steps.
Now, the interesting part comes when
you want to make a tenable policy
and you wanted to learn from its successes in the environment.
To do that, we put a neural network in the loop.
So the neural network takes in the observations, an image.
In this case-- and the algorithm that we're talking about is
called policy gradients, also known as reinforce--
it's going to emit probabilities over the actions that
can be taken.
So in this case, it's going to emit
a probability of taking a left or a probability of taking
a right, and that's parameterized
by the weight of the neural network called data.
And ultimately, the goal of this algorithm
is going to be modifying the neural network over time
to maximize what's called the expected return.
And as I mentioned, the return is
the sum of the rewards over the duration of the episode.
And you can calculate it--
by just this expectation, it's difficult to calculate
analytically.
So what we're going to do is we're going to sample episodes
by playing, we're going to get trajectories,
and we're going to store those trajectories.
These are observation action pairs over the episode.
We're going to add them up.
And that's our Monte Carlo estimate of the return.
OK?
And we're going to make a couple of checks
to convert that expectation optimization problem into a sum
that we can optimize using gradient descent.
I'm going to skip over some of the math,
but basically, what we use is something called the log
trick to convert this gradient problem
into the gradient over the outputs of the neural network.
That's that log pi theta right there.
That's the output of the network.
And we're going to multiply that by the Monte Carlo
estimate of the returns.
And we're going to average over the timestamps
within the episode and over many batches of episodes.
Putting this into code--
and by the way, we implement this for you,
but that's kind of a pseudo code here.
You get this experience when you're training,
you extract its rewards, and you do a cumulative sum type
operation to calculate the returns.
Then, you take the observations over all the time steps,
and you calculate the lotus, the log probabilities coming out
of the neural network.
You pass those to a distribution object--
this is a TensorFlow probability distribution object--
to get the distributions over the action.
And then you can calculate the full log probability
of the actions that were taken in your trajectories
and your logs, and calculate this approximation
of the expectation and take its gradient.
OK, so as an end user, you don't need
to worry about that too much.
What you want to do is you load your environment,
you wrap it in something called a TF Py environment.
And that's easy as the interaction between the Python
problem setting and the environment
and the neural network, which is being executed
by the TensorFlow runtime.
Now, you can also create your neural network.
And here, you can write your own.
And basically, it's a sequence of Keras layers.
Those of you who are familiar with Keras,
that makes it very easy to describe your own architecture
for the network.
We provide a number of neural networks.
This one accepts a number of parameters that
configure the architecture.
So here, there are two fully connected layers
with sizes 32 and 64.
You pass this network and the specs
associated with the environment to the agent class.
And now you're ready to collect that and to train.
So to collect data, you need a place to store it.
And Sergio we'll talk about this more in the second example.
But basically, we use something called
replay buffers that are going to store these trajectories.
And we provide a number of utilities
that will collect the data for you,
and they're called drivers.
So this driver takes the environment,
takes the policy exposed by the agent path
and a number of callbacks.
And what it's going to do is it's
going to iterate collecting data, interacting
with the environment, sending it actions, collecting
observations, sending those to the policy, does that for you.
And each time it does that, for every time stop,
it's stores that in the replay buffer.
So to train, you iterate calling a driver run, which
populates the replay buffer.
Then you pull out all of the trajectories in the replay
buffer with gather all, you pass those
to agent.train, which updates the underlying neural networks.
And because policy gradients is in something called an
on policy algorithm, all that hard earned
collected data that you've done, you
have to throw it away and collect more.
OK?
So that said, CartPole is a fairly straightforward
classical problem, as I mentioned.
And policy gradients is a fairly standard, somewhat simple
algorithm.
And after about 400 iterations of playing the game,
you can see that whereas you started
with a random policy, that can't keep the pole up at all.
After 400 iterations of playing the game,
you basically have a perfect policy.
And if you were to look at your TensorBoard
while you're training, you'd see a plot
like this, which shows that as the number of episodes
that are being collected increases the total return--
which is the sum of the rewards over the episode--
goes up pretty consistently.
And at around 400, 500 episodes, we have a perfect algorithm
that runs for 200 steps, at which point the episode says,
all right, you're good, you win.
And then you're done.
OK, so I'm going to hand it back over to Sergio
to talk about Atari and deep Q-learning.
SERGIO GUADARRAMA: Thank you, again.
So now we're going back to this example
that I talked at the beginning about how to play this game.
And now we're going to go through more details
how this actually works, and how this deep Q-learning works
to help us in this case.
So let's go back to our setting.
Now we have our environment where
we're going to be playing.
We're going to get some observations,
in this case, frames.
The agent role is to produce different actions
like go left with the paddle or go right, and get some rewards
in the process, and then improve over time
by basically incorporating those rewards into the model.
Let's take a little step and say,
what if while I'm playing Breakout,
I have seen how far what I've been doing,
the ball is going somewhere, I'm moving
in the centered direction, and then, what should I do now?
Should I go to the right or should I go to the left?
If I knew what is going to happen, it will be very easy.
If I knew oh, the balls are going to go this way,
the things are going to be that, that will be easy.
But that one, we don't easily know.
We don't know what's going to happen in the future.
Instead, what we're going to do, we
are going to try to estimate if I move to the right,
maybe the ball will drop.
It's more likely that the ball drop
because I'm moving in the opposite direction
that the ball is going.
And if I move to the left, on the contrary,
I'm going to hit the ball I'm going to hit some bricks,
I'm getting closer to clear all the bricks.
So the idea is like I want to learn
a model that can estimate that.
If this action is going to make me go better into the future
or is going to make it go worse.
And that's something that we call expected return.
So is this nothing that Eugene was talking before,
that before we were just computing, just summing
on the rewards.
And here, we're going to say, I want
to estimate this action, how much reward it's
going to give me in the future.
And then, I choose the thing what
it's according to my estimate is the best action.
So we can formulate this with using math.
It's basically like an expectancy over the sum
of the rewards into the future.
And that's when I call this Q function, or like critic.
It's also a critic because it's going to basically tell us
given some state and possible actions, which
action is actually better?
What I criticize in some way, like if you take this action,
my expectation of the return is very high.
And you take a different action, my expectation is low.
And then what we're going to do, we're
going to learn these Q functions.
Because we don't know.
We don't know what's going to happen.
But by playing, we can learn what
is the expected return by comparing our expectation
with the actual returns.
So we are going to use our Q function-- and in this case,
a neural net--
to learn this model.
And while we have a learned model,
then we can just take the best action according to our model
and play the game.
So conceptually, this looks similar to what we saw before.
We're going to have another neural net.
And this case, the output is going
to be the Q values, this expectation
of our future returns.
And the idea is we're going to get an observation,
in this case, the frames.
We're going to maybe have some history about it.
And then we're going to preview some Q value, which
like our current expectation if I move to the left,
and my current expectation if I move to the right.
And then I'm going to I compare my expectation,
what will actually happen.
And if basically my expectation is too high,
I'm going to lower down.
And if my expectation is too low, I'm going to increase it.
So that way, we're going to change
the weight of this network to basically improve over time
by playing this game.
We go back to how you do this into code.
Basically, we're going to log this environment.
In this case, from the suite Atari where it's also
available for an OpenAi.
I'm going to say, OK, load the Breakout game.
And now, we are ready to play.
We're going to have some reservations,
we'll define what kind of reservations
we have from this case where it's frames of like 84
by 84 pixels, and we also have multiple answers we can take.
In this game, we can only go left and right,
but there are other games in this suite that
can have different actions.
Maybe jumping, firing, and doing other things
that different games have.
So now, we want to do that notion what we said before.
We're going to define this Q network.
Remember, it's a neural net that is going
to represent these Q values.
I'm going to have some parameters that
define how many layers, how many things we want
to have on all those things.
And then, we're going to have the Q agent that's
going to take the network, and an optimizer which
is going to basically be able to improve this network over time,
given some experience.
So this experience, we're going to assume
we have collected some data and we have played the game.
And maybe not very well at the beginning,
because we are doing random actions, for example.
So we're not playing very well.
but we can get some experience, and then we
can improve over time basically.
We try to improve our estimates.
Every time we improve, we play a little better.
And then we collect more data.
And then the idea is that this agent
is going to have a train method that
is going to go through this experience,
and is going to improve over time.
In general, for cases like games or environments are too slow,
we don't want to play one game of the time, you know.
These computers can play multiple games in parallel.
So we have this notion that parallel environments,
you can play multiple copies of the same game at the same time.
So we can make learning a lot faster.
And in this case, we are playing four games in parallel,
we're going to have for a policy that we've got just defined.
And in parallel, we can just play four games
at the same time.
So the agent in this case will try
to play four games at the same time.
And that way, we'll get a lot of more experience
and can learn a lot faster.
So as we mentioned before, where we have collected all this data
by playing this game, in this case,
we don't want to throw away the data.
We can use it to learn it.
So we're going to have this replay buffer, which
is going to keep all the data we're collecting,
like different games will go in different positions
so we don't mix the games.
But we're going to just throw all the data in some replay
buffer.
And that into the code, it's simple.
We have this environment.
We cleared the replay buffer we have already defined.
And then basically, using the driver, and then
more important than this, add to the replay buffer.
Every time you play, take an action in this game,
add it to the replay buffer.
So later, the agent contains all that experience.
And because DQN is our policy method--
what is different than their previous method was on policy--
in this case, we can actually use all the data.
We can keep all the data around and keep training
on all data too.
We don't need you to throw away.
And that's very important because we
make it more efficient.
What we're going to do when we have
called the data and this replay buffer,
we're going to do a sample.
We're going to sample a different set of games,
different parts of the game.
I'm going to say, OK, let's try to replay the game
and maybe take a different outcome this time.
What action will you take if you were in the same situation?
Maybe you move to the left, and the ball drop.
So maybe now you want to move to the right.
So that's the ways the model is going
to be learning, by basically sample games that you played
before, and now improve your key function is going
to change the way you behave.
So now let's try to put these things back together.
Let's go slowly because there's a lot of pieces.
So we have our Q network we're going
to use to define the DQN agent in this case.
We're going to have the replay buffer where
we're going to put all the data we're
collecting what we played.
We have this driver, which basically
drive the agent in the game.
So it's going to basically driving
the agent making play and add it to the replay buffer.
And then, once we have enough data,
we can basically iterate with that data.
We can iterate, get batches of experience, different samples.
And that's what we're going to do to train the agent.
So we are going to alternate, collect more data,
and train the agent.
So every time we collect, we train the agent,
the agent gets a little better, and we
want to collect more data, and we alternate.
At the end, what we want to do is evaluate this agent.
So we have a method that says, OK, I
want to compute some metrics.
For example, how long are you playing
the game, how many points are you getting,
all those things that we want to compare
metrics and aggressively have these methods.
OK, how about take all these metrics in this environment,
take the agent policy, and evaluate
for multiple games, multiple episodes, and multiple things.
How this actually looks like is something like that.
For example, in the Breakout game,
the curves looks like that.
At the beginning, we don't score any points.
We don't know how to move the pole,
the ball just keep dropping, and we just
lose the game over and over.
Eventually, we figure out that by moving the paddle
in different directions, the ball bounced back,
and it started hitting the bricks.
And about 4 or 5 million frames, they multilaterally
learn how to actually play this game.
And you can see around 4 or 5 million
frames, basically, the model gets very good scores
around 100 points.
It's breaking all these things, all these points,
and you know, clear all the bricks.
We also put graphs of different games
like Pong, which is basically two different paddles trying
to bounce the ball between them.
Enduro, Qbert, there's another like 50 or 60
games in this suite.
And you can basically just change one line of code
and play a different game.
I'm not going to go through those details,
but just to make clear that it's simple to play different games.
Now, let me hand it over back to Eugene,
who's going to talk a little more into the Minitaur.
Thanks, again.
EUGENE BREVDO: OK, so our third and final example
is the problem of the Minitaur robot
and kind of goes back to one of the first slides
that Sergio showed at the beginning of the talk,
learning for walk.
So there is a real robot.
It's called the Minitaur.
And here, it's kind of failing hard.
We're going to see if we can fix that.
The algorithm we're going to use is called Soft Actor Critic.
OK.
So again, on the bottom is some images of the robot.
And you can see it looks a little fragile.
We want to train it, and we want to avoid
breaking it in it beginning when our policy can't really
stay up.
So what we're going to do is we're
going to model it in a physics simulator called PyBullet,
and that's what you see at the top.
And then, once we've trained it, we're
confident about the policy on that version,
we're going to transfer it back into the robot
and do some final fine tuning.
And here, we're going to focus on the training and simulation.
So I won't go into the mathematical details
of the Soft Actor Critic, but here's some fundamental aspects
of that algorithm.
One is that it can handle both discrete and continuous action
spaces.
Here, we're going to be controlling some motors
and actuators, so it's a fairly continuous action space.
It's data-efficient, meaning that all this hard earned data
that you run in simulation or you got from the robot,
you don't have to throw it away while you're training,
you can keep it around for retraining.
Also, the training is stable.
Compared to some other algorithms,
this one is less likely to diverge during training.
And finally, one of the fundamental aspects
is that it's Soft Actor Critic, it
combines an actor neural network and a critic neural network
to accelerate training and to keep it stable.
Again, so Minitaur, you can basically
do a pip install of PyBullet, and you'll
get Minitaur for free.
You can load it using the PyBullet fleet with TF-Agents.
And if you were to look at this environment,
you'd see that there are about 28 sensors on the robot
that return floating point values,
different aspects of the configuration where
you are, forces, velocities, things like that.
And the action, there are eight actuators on the robot.
It can apply a force--
positive or negative, minus 1 to 1--
for each of those eight actuators.
Now, here's kind of bringing together the whole setup.
You can load four of these simulations,
have them running in parallel, and try
to maximize the number of course that you're using
when you're collecting data.
And to do that, we provide the parallel Py environment, which
Sergio spoke about, wrapped in TF Py environment.
And now, we get down to the business
of setting up the neural network architecture for the problem.
First, we create the actor network.
And so what the actor network is going to do
is it's going to take these sensor observations, this 28
vector, and it's going to limit samples of actuator values.
And those samples are random draws from,
in this case, Gaussian or normal distribution.
So as a result, this action distribution network
takes something called a projection network.
And we provide a number of standard projection networks.
This one emits samples from a Gaussian distribution.
And the neural network that feeds into it
is going to be setting up the hyperparameters
of that distribution.
Now, the critic network, which is in the top right,
is going to take a combination of the current sensor
observations and the action sample
that the actor network emitted, and it's
going to estimate the expected return.
How much longer, given this action,
is my robot going to stay up?
How well is it going to gallop?
And that is going to be trained from the trajectories
from the rewards that you're collecting.
And that, in turn, is going to help train the actor.
So you pass these networks and these specs
to the Soft Actor Critic agent, and you can
look at its collection policy.
And that's the thing that you're going
to pass on the driver to start collecting data
and interacting with the environment.
So I won't go into the details of actually doing
that because it's literally identical to the deep
Q-learning example before.
You need the replay buffer, and you use the driver
and you go through the same motion.
What I'm going to show is what you
should expect to see in the TensorBoard
while you're training the simulation.
On the top, you see the average episode length,
the average return as a function of the number of environment
steps that you've taken-- the number of time steps.
On the bottom, you see the same thing.
But on the x-axis, you see the number of episodes
that you've gone through.
And what you can see is that after about 13,000,
14,000 simulated episodes, we're starting to really learn
how to walk and gallop.
The episode lengths get longer because it
takes longer to fall down.
And the average return also goes up
because it's also a function of how long we stay up
and how well we can gallop.
So again, if this is a pilot simulation,
a rendering of the Minitaur, at the very beginning,
when the policy just emits random values,
the neural network emits random values--
it's randomly initialized--
and it can barely stay up, it basically falls over.
About halfway through training, it's
starting to be able to get up, maybe make a few steps,
falls over.
If you apply some external forces, it'll just fall over.
By about 16,000 iterations of this,
it's a pretty robust policy.
And it can stand, it can gallop.
If there's an external force pushing it over,
it'll be able to get back up and keep going.
And once you have that trained policy,
you can transfer it, export it as a safe model,
put it on the actual robot, and then start the fine tuning
process.
Once you fine tuned it, you have a pretty neat robot.
In my head, when I look at this video,
I think of the "Chariots of Fire" theme song.
I don't know if you've ever seen it, but it's pretty cool.
So now, I'm going to return it back to Sergio
to provide some final words.
SERGIO GUADARRAMA: Thank you, again.
So pretty cool, no?
You can get from the beginning how
to learn to walk and naturally make these in simulation.
But then, we can transfer it to a real robot
and make it work into a real robot.
So that's part of the goal of TF-Agents.
We want to make our role very easy.
You can download the code, you can scan over there
and go to the GitHub, start playing with it.
We have already a lot of different environments,
more than we talked today.
There's many more.
So we just covered three examples, but you can go there,
there's many other environments available.
We are hoping that Unity ML-Agents
come soon so you can also interact with the Unity
renders.
Maybe there's some of you who are actually
interesting to contribute into your own environments,
your own problems.
We are also very happy to take proof requests
and contributions to everything.
For those of you who say, OK, those games are really good.
The games looks nice, but I have my own problem.
What do I do?
So let's go back to the beginning when
we talk about you can define your environment,
you can define your own task.
This is the main piece that you need to follow.
This is the API you need to follow
to bring your task or your problem to TF-Agents.
You define the specifications of your observations,
like what things can I see?
Can I see images, can I see numbers, what that means?
What actions available do I have?
Do I have two options, three options, 10 different options?
What are the possibilities I have?
And then the reset method because as we say,
while we're learning, we need to keep trying.
So we need to reset and start again.
And then the stop function where it's
like, if I give you an action, what will happen?
How the environment, how the task is going to evolve?
What is this state is going to change?
And you need to tell me the reward.
Am I doing well?
Am I going in the right direction,
or am I going in the wrong direction?
So I can learn from it.
So this is the main piece of code
that you will need to implement to solve your own problem.
Additionally, we only talked about three algorithms,
but we have many more in the code base.
You can see here, there are many more coming.
So there's a lot of variety of different algorithms
have different strands that you can
apply to different problems.
So you can just try different combinations
and see which one actually works for your problem.
And also, we are taking contributions for other people
who say, oh, I have this algorithm,
I want to implement this one, I have this new idea,
and maybe you can solve other problems with your algorithm.
And furthermore, we also apply these not only to this game,
but we apply at Google, for example, in robotics.
In this really complex problem, that we
have multiple robots trying to learn how to grasp objects
and moving to different places.
So in this case, we have all these robots just trying
to grasp and fail at the beginning.
And eventually, they learned like, oh, where is the object?
How do I move the hand?
How do I close the gripper in the proper place?
And now how do I grasp it?
And this is a very complex task you can
solve with reinforced learning.
Furthermore, you can also solve many other problems
for simple recommender systems like YouTube recommendations,
Google Play, Navigation, News.
Those are many other problems that you can basically
say, I want to optimize for my objective, my long term value.
Not only the short term, but like the long term value.
And that is really good for that when
you want to optimize for the long term
value, not only the short term.
Finally, we have put a lot of effort
to make this code available and make
it usable for a lot of people.
But at Google, we also defined these AI principles.
So when we developed all this code, we make it available,
we follow these principles.
We want to make it sure that it is used for things that benefit
the society that doesn't reinforce unfair bias, that
doesn't discriminate, that this built
for tests for safety, privacy, in the beginning
it's accountable.
We keep very high standards.
And we also want to make sure that everybody
who uses this code also embraces those principles
and trying to make it better.
And there's many applications we want to pursue.
We don't want to be this used for harming
and all these damaged things that we know will happen.
Finally, I want to thank the whole team.
You know, it's not just Eugene and me, we made this happen.
There's other people behind.
These are the TF-Agents over here.
There's a lot of contributors that
have contributed to the code, and we
are very proud of all the work they
have done to make this happen, to make
this possible to be open source and available for everyone.
So as we said before, we want all of you
to join us in GitHub.
Go to the web page, download it, start playing with it.
A really good place is go to the collapse on their notebooks
and say, OK, I want to try the REINFORCE example,
I want to try the DQN or the Soft Actor Critic.
We have notebooks you can play, Google Cloud
will run for you all these examples.
And also, you have issues of pole requests, we welcome.
So we want you to be part of their community,
contribute to make this a lot better.
And furthermore, we are also looking
for new applications, what all of you
can do with these new tools.
There's a lot of new problems you can apply this,
and we are looking forward to it.
So thank you very much, and hope to see you around.
[APPLAUSE]
[MUSIC PLAYING]