I benchmarkingsafeexplorationindeepreinforcement, learningwhatcomeswiththis.
There's a blockpost.
So I wantedtoexplainthisbecausewhen I sawthebloodpostand I think a lotofcomputerfileofuswouldhavethesamereaction, What I thoughtis, whatthehellis a gamegoing?
Somekindofmeatsbastingtonight.
Soit's a bunchoftheseenvironments, right?
That, umthatallowyoutotraittrain?
Yeah, theopening I SafetyJimBenchmarksuite, whichis a bunchoftheseenvironmentsthatyoucanrunningyoursystemsinandhavethemlearn.
Isthisanythingtodowiththose?
Aye, Aye, Gridworldretort.
Yeah, yeah, kindof.
InthesamewaythattheGridWorld's paperdid, Thispaperintroducesenvironmentsthatpeoplecanusetotesttheir A i systems, andthisisfocusingspecificallyonsafeexploration.
OnDhehas a fewdifferences.
Theykindofcomplementarytheenvironmentsinthisair a littlebitmorecomplex, theircontinuousintimeandinspacein a waythattheirgridworldsalllikeverydiscreet.
Youtaketurnsandyoumovebyonesquare.
Or, asinthiscase, it's a lotmorelikethejoke.
Oh, whereyouactuallyhavelike a physicssimulationthatthesimulatedrobotsmovearoundin.
Soit's a slightlymorecomplexkindofenvironment.
Um, buttheideaistohaveinthesamewayaswithgridworldsoranythingelsetohave a standardizedsetofenvironmentssothatyouknow, everybody's comparinglikewithlikeandyouactuallyhavesortofstandardizedmeasurementsandyoucanbenchmark.
I haven't seen a proofforthis, but I thinkthatforanyconstraint, reinforcementlearning, uh, setup, youcouldhaveone.
Thatwasjust a standardrewardfunction.
Butthisislike a muchmoreintuitivewayofexpressingthesethings.
Soitkindofremindsmeofthere's a bitinHitchhiker's Guidewheresomebodyislike, Oh, you'vegot a solution?
No, but I'vegot a differentnamefortheproblem.
I mean, thisisbetterthanthat, becauseit's a differentwayofformalizingtheproblemdifferentwayofsortofspecifyingwhattheproblemis, andactually a lotofthetimefindingtherightformalismis a bigpartofthebattle, right?
Um, it's alsonicebecauseifyou'retryingtolearn, so I I did a videorecentlyonmychannelaboutrewardmodeling, whereyouactuallylearnedtherewardfunctionratherthanwritingtherewardfunctionyouhave a partof.
Likeifyouhave a robotarmandit's makingpensandyouwanttoretrainittomakemugsorsomethinglikethat, thenitwouldbethatyouwouldhavetojustrelearntherewardfunctioncompletely.
Butifyouhave a constraintthatit's learned, that's like, don't hithumans.
Youarejusttryingtofind a policythatmaximizestherewardandlikehowyoudothatiskindofuptoyou.
Butwhatthatmeansthatstandardreinforcementlearningsystemsintheprocessoflearningwilldohugeamountsofunsafestuffright, whereasin a constrainedreinforcementlearningsetting, youactuallywanttokeeptrackoffhowoftentheconstraintsareviolatedduringthetrainingprocess.
Andyoualsowanttominimizethat.
Whichmakestheproblemmuchharderbecauseit's notjustmakerand A I systemthatdoesn't crash.
You'vegotpoint, whichisjust a littleroundrobotwith a squareonthefrontthatcanturnanddrivearoundcar, whichit's a similarsortofsetupyethasdifferentialdrive.
Oh, I kidyou, notDoggare, whichis a quadrepairedthatwalksaroundandthenyouhave a bunchofthesedifferentenvironmentswhicharebasicallylikeyouhavetogooverhereandpressthisbutton, andthenwhenyoupressthebutton, a differentbuttonwilllightup.
Andnotbreakingthebarsisnotbumpingintothegremlinsorwhateverelse, andthattheycanlearnwith a minimumofviolatingtheseconstraintsduringthetrainingprocessaswell.
Um, whichis a reallyinterestingandquitehardproblem, andthentheyprovidesomebenchmarks, andtheyshowthatstandardreinforcementlearningagentssuckatthis, tryingtodoanythingtolearn.
Theydon't careabouttheloomingexactly, exactlyon.
Thenthereare a fewdifferentotherapproachesthatdobetter.
ThisisreallyniceifyouhaveideasandagainliketheGridWorld's thing, youcandownloadthisandhave a go.