Thereare a numberofguidelinesontheTensorflowwebsitethatyoucantryandimprovetheperformanceoryourtraininginthistalkwillfocusondistributortrainingthatisrunningtraininginparallelonmultipledevices.
Such a CIBusedthreepewsorTPuseinordertomakeyourtrainingfasterwiththetechniquesthatwilltalkaboutinthistalk, youcanbringdownyourtrainingtimefromweekstwohours, withjust a fewlinesofcodeand a fewpowerfuljeepuseinthegrass.
Asyoucanseeasweincreasethenumberofdeepusefrom 1 to 4 toeight, theimagesforsecondprocessitcanalmostdoubleeverytimewe'llcomebacktotheseperformancenumberslaterwithmoredetails.
Sobeforedivingintothedetailsofhowyoucangetthatkindofscalingintensorflowfirst, I wanttocover a fewhighlevelconceptsandarchitecturesanddistributedtraining.
Thiswillgiveus a strongfoundation, butwhichtounderstandthevarioussolutionsasyourfocusontrainingtoday, let's take a lookatwhat a typicaltrainingLukelookslike.
Let's sayyouhave a simplemodernlikethiswith a coupleofhiddenlayers.
Eachlayerhas a bunchofratesandbiases, alsocalledthemodelParametersortrainablevariables.
A trainingstepbeginswithsomeprocessingontheinputdatawherethenfeedthisinputintothemodelandcomputethepredictionsintheforwardpass.
Let's sayyoubeginyourtrainingforthesimplemachineunderyourdeskwith a multicoreCPU.
Luckily, tensorflowhandlescalingonto a multicoreCPUforyouautomatically.
Next.
Youmayspeedupbyaddinggoadacceleratortoyourmachines, suchas a jeep.
You r a t p.
U.
Thedistributortraining.
Youcangoevenfurther.
Wecangofromonemachinewith a singledevicetoonemachinewithmultipledevicesandfinallytomultiplemachineswithpossiblymultipledeviceseachconnectedoverthenetworkwith a numberoftechniques.
Andthat's indeedwhatwedoin a lotofGooglesystems, bytheway, andtherestofthistalkwillusethetermsdeviceworkeroracceleratortorefertoprocessingunitssuchasRefuseor T Pewssohardhisdistributortrainingwork.
Likeeverythingelseinsoftwareengineering, thereare a numberofwaystogoaboutwhenyouthinkaboutdistributingyourtraining.
Thisapproachhasbecomemorecommonwiththericeofffastacceleratorssuch a steepuseRGpews.
Inthisapproach, eachworkerhas a copyoffparametersonitsown.
Therearenospecialparameterservers.
Eachworkercomputesthelossingredientsbasedon a subsetoftrainingsamples.
Once a greedyINTheircomputed, theworkerscommunicateamongthemselvestopropagatetheGradyINTsandupdatethemodelsparameters.
Alltheworkersaresynchronized, whichmeansthatthenextroundofcomputationdoesn't beginuntileachworkerhasreceivedtheupdatedradianceand a bitter.
That's mottowhenyouhavefastdevicesin a controlledenvironment, thevarianceoffsteptimebetweenthedifferentworkerscanbesmallwhencombinedwithstrongcommunicationlinksbetweenthedifferentdevices.
Such a steepuseormultiple G abuseon a singlemachineparameterseverapproachhasbeenaroundfor a while, andithasbeensupported, wellintensiveful.
Keepuse.
Ontheotherhand, useallofyouroldreduceapproachoutoftheboxinthenextsectionofthistalkwillshowyouhowyoucanscaleyourtraining, usingtheaudio's approachonmultipleJeepusewithjust a fewlinesofcode.
Before I getintothat, I justwanttomentionanothertypeoffdistributortrainingknownasmodelparallelism, thatyoumayhaveheardof a simplewaytothinkaboutmortalperilis, um, iswhenyourmodelissobigthatdoesn't fitinthememoryoffonedevice.
Nowthatyou'rearmedwithfundamentalsofdistributedtrainingarchitectures, let's seehowyoucandothisIntensivelow, as I alreadymentioned, we'regoingtofocusonscalingtomultiple G pewswiththearteriesareconjectureinordertodosoeasily.
I'm pleasedtointroducethenewdistributionstarted G a p I.
Thisap I allowsyoutodistributeyourtrainingintensiveflowwithverylittlemodificationtoyourcode.
Youdon't needtoworryaboutstructuringyourmodelin a waythattheGradyINTsourlossesacrossdevicesareaggregatedcorrectly, distributiondoessodistributionstrydoesthatforyou.
Inourexample, we'regoingtobeusingpensivelows, Highlevel A P I calledestimator.
Ifyouhaveuses a P I beforeyoumightbefamiliarwiththefallingsnippetofcodetocreate a customestimator, itrequiresthreearguments.
The 1st 1 is a functionthatdefinesyourmodel, sodefinestheparametersoffyourmodel, howyoucomputethelossintheGradyINTsandhowyouupdatethemodelsparameters.
Thesecondargumentis a directorywhereyouwanttopersistthestateoffyourmodel, andthethirdargumentis a configurationcalledDrunkenFake, whereyoucanspecifythingslikehowoftenyouwanttocheckpoint, Howoftensummaryshouldbesavedandsooninthiscasereviewsthedefaultonconflict.
A badsizeoff 10 24 or 1 28 40 You, ourmoderndirectoryisgoingtopointtothe G.
C s bucketthat's gonnaholdourcheckpointsandsummariesthatwewanttosave.
WepointourdatadirectorytotheSSTdisc, which, astheimagein a datasetandthenumberofabuseiseightoverwhichwewanttodistributeortradeourmodern.
Solet's runthismodelalone.
Andasthemodelisstartingtotrain, let's take a lookatsomeoffthecourtchangesthatareinvolvedintochangetheresidentmodernfunctionSothisis a residentmanfunctioninthegarden.
ModernReport.
FirstWinston, sheatethemiddlestrategyobject.
Thenwepassittotherun, canfake.
It's partofthetrain.
Distributeargument.
Wecreateanestimatorobjectwiththerunconflictandthenwecalltrainonthisestimatorobjectandthat's it.
Let's lookat a fewperformancebenchmarksontheDJ X oneDJ X oneis a machineonwin, whichonwhichwerundeeplearningmodels.
We'rerunningMrMixedPositiontrainingwith a poorcheapyoubadsizeoffto 56.
ItalsohaseightWalteroverthe 100 cheapuse, Sothegraphshows X axis.
Ah, thenumberofGPSonthe X axisandimagespersecondonthe Y axis.
Soaswegofromonejob, youtowitwereabletoachieve a speedupoffseven X andthisisperformancerightoutofftheboxwithnotuning, we'reactivelyworkingonimprovingperformancesothatyou'reabletoachievemorespeedupandgetmoreimagespersecondwhenyoudistributeyourmodelacrossmultiplechipuse.
I'm goingtoshowyouhowTensorflowmakesiteasyforyoutouse T FDRdataFBI's toBuildEfficientandPerformanceinPortPipelinesHere's a simpleandputpipelineforResident 50.
We'regoingtouse D ofdotada, FBI's becausedatasetsareawesome.
Let's sayyouhave a lotofdatathatshard a caracross a cloudstorageservice.
Youwon't readmultiplefilesinparallel, andyoucandothisusingthenumparallelreadscalledaswhenyouinstantlythatyear, everwhenyouqualityAfricadataset a P I.
Thisallowsyoutoincreaseyoureffectivethroughput.
Wecanalsoparalyzemapfunctionfortransformations, youcoulddateupourlivesthejust a differenttransmissionforoffthemapfunctionbyusingthenumberedlittlecallsargument.