ReinforcementLearning
ˇ1,GregCalbert2,andJasonRobertFitch1,BernhardHengst1,DorianSuc
Scholz2
NationalICTAustralia,UniversityofNSW,Australia{robert.fitch,bernhard.hengst,dorian.suc}@nicta.com.au
DefenceScienceandTechnologyOrganization,SalisburySAAustralia
{greg.calbert,jason.scholz}@dsto.defence.gov.au
1
2
Abstract.Achallengeinapplyingreinforcementlearningtolargeprob-lemsishowtomanagetheexplosiveincreaseinstorageandtimecom-plexity.Thisisespeciallyproblematicinmulti-agentsystems,wherethestatespacegrowsexponentiallyinthenumberofagents.Functionap-proximationbasedonsimplesupervisedlearningisunlikelytoscaletocomplexdomainsonitsown,butstructuralabstractionthatexploitssys-tempropertiesandproblemrepresentationsshowsmorepromise.Inthispaper,weinvestigateseveralclassesofknownabstractions:1)symme-try,2)decompositionintomultipleagents,3)hierarchicaldecomposition,and4)sequentialexecution.Wecomparememoryrequirements,learningtime,andsolutionqualityempiricallyintwoproblemvariations.Ourre-sultsindicatethatthemosteffectivesolutionscomefromcombinationsofstructuralabstractions,andencouragedevelopmentofmethodsforautomaticdiscoveryinnovelproblemformulations.
1Introduction
Whenspecifyingaproblemsuchaslearningtowalk,learningtomanipulateobjectsorlearningtoplayagameasareinforcementlearning(RL)problem,thenumberofstatesandactionsisoftentoolargeforthelearnertomanage.Itisstraightforwardtodescribethestateofasystemusingseveralvariablestorepresentthevariousattributesofitscomponents.Similarly,individualsystemcomponentactionscanbeusedtodescribeajointactionvector.However,thisapproachveryquicklyleadstoanintractablespecificationoftheRLproblem.Atabularstate-actionrepresentationrequiresstorageproportionaltotheproductofthesizeofallthestateandactionvariables,leadingtointractablestorageandtimecomplexity.
Functionapproximationcanoftenhelpbygeneralizingthevaluefunctionacrossmanystates.Functionapproximationisbasedonsupervisedlearning.Gradientdescentmethods,suchasartificialneuralnetworksandlinearfunctionapproximationarefrequentlyusedinRLforthispurpose[1].However,therearereasonstobelievethatsimplefunctionapproximationwillnotscaletolarger,morecomplex,problems.
Someproblemsmayhavestructureandregularitiesthatarenotreadilyap-parentanddifficulttolearninonestep[2].Learningmaybemadetractablebytherightproblemdecompositionandbylearninginstages,reusingandcombin-ingpreviouslylearntconcepts.Learningincomplexenvironmentsproceedsatthe“frontierofreceptivity”wherenewtaskscanonlybelearntoncecomponentchildtaskshavebeenlearnt[3].
Ourobjectiveinthispaperistoexploreseveralstructuralabstractionmeth-odsandtostudytheireffectonstoragerequirements,learningtimeandthequal-ityofthesolutionforoneproblemwithtwodifferentlevelsofinter-dependency.Thecontributionofthispaperistogiveasystematicdescriptionandexperi-mentalevaluationofacollectionofknownstructuralabstractionsthatcanbeusedinawidevarietyofRLproblems.Ourmotivationcomesinpartfromthedesiretodeveloptractabledecisionsystemsforcomplexmulti-agentgames,andsecondlytogaininsightforautomatingthemodellingprocess.
Ourexperimentsinthispaperarebasedonataskthatrequirestwotaxisactingtogethertopickupanddelivertwopassengersonthe5×5grid.Thistaskischosenbecauseitisintuitiveandsmallenoughsothatwecanstillcomputetheoptimalsolution,yetitiscomplexenoughtomakeitchallengingtodemonstratevariousabstractions.
TheformalismsunderpinningtheexperimentsinthispaperarethoseofMarkovdecisionproblems(MDPs)andmodelreductionsexpressedashomomor-phicmappings.Thelatterareimportantinestablishingwhetheronesystemisamodelofanotherandwhichpropertiesoftheoriginalsystemthemodelretains[4].Theabstractionswewillexplorearesymmetryreductions,multi-agentde-compositions,hierarchicaldecompositions,sequentialexecutionandsomecom-binations.
Intherestofthispaperwewillfirstintroducethetwo-taxitaskasourworkingexampleandsummarizetheformalisms.Wetheninturndescribethevariousstructuralabstractionsoftheproblemincludingrelatedwork.Ineachcasewediscussthecomputingrequirementsinrelationtosolutionqualityforthetaxitaskandcommentonthescalingpotential.
2TheTwo-TaxiProblem
Taxiproblemshavebeenusedbyseveralresearchersinrelatedwork.Theyhaveanaloguesinotherproblems,suchaslogisticproblemsinvolvingtransportersandcargo.Wehavetakentwotaxis[5]workingjointlytopickupanddelivertwopassengersonthe5×5gridshowninFigure1.
Theobjectiveistodeliverbothpassengersintheminimumnumberofsteps.AteachtimestepataxicantakeanavigationactionandattemptamoveonepositionNorth,South,EastorWest.Theseactionsarestochastic.With92.5%probabilitytheymovethetaxiintheintendeddirectionandwith7.5%probabilitytheymovethetaxirandomlyinoneoftheotherthreedirections.Otheractionsavailablearetopickuppassengerone,pickuppassengertwo,putdownitspassengerorstay.Taxiscanonlycarryonepassengeratatime.Amove
RGYBFig.1.Thetwo-taxiProblem.
intoabarriersorthegridboundaryiscountedasastepandresultsinthetaxistayingwhereitis.TherearefourspeciallocationsmarkedR,G,Y,Bthatrepresentthepossiblesourceanddestinationlocationsofthepassengers.Atthecommencementofeveryepisodeeachpassengerislocatedatrandomononeofthespeciallocationsandeachtaxiislocatedatrandomanywhereonthegrid.Eachpassengermustbepickedupbyoneofthetaxisandputdownathisorherdestinationlocation,evenifthesourceanddestinationlocationsarethesame.Ifapassengeriswaitingtobepickedupandbothtaxisareatthislocationthenajointactionbybothtaxistopickupthepassengerresultsinonlytaxionebeingsuccessful.
Weconsidertwoversionsoftheproblem,withandwithouttaxicollisions.Withoutcollisionsthetaxisareabletooccupythesamecellandpasseachotherwithoutimpediment.Withcollisionsthetaxiscannotoccupythesamecellnorcantheypasseachotherinthesensethattheycannotoccupyeachother’sformercellfollowingajointaction.Amoveresultinginacollisionleavesthetaxilocationsunchanged.Initialtaxipositionsonthesamecellaredisallowedwithcollisions,althoughpassengerscanhavethesamepickupanddestinationslocations.
3ModellingFormalisms
WenowintroduceRL,itsapplicationtothetwo-taxitaskandabstractionsformalizedashomomorphisms.3.1
RLProblemFormulation
UnderlyingRListheMarkovdecisionproblemframework.Weconsiderasys-temmodelledatabaselevelbyasetofstates.Ateachtime-stepthesystemtransitionsfromstatetostatedependingonaninputactiontothesystemandgeneratingareal-valuedreward.WecandescribeaMarkovdecisionproblem(MDP)asatuplewhereSisthefinitesetofstates,Aisthefinitesetofactions,T:S×A×S→[0,1]isthetransitionfunctiongivingtheprobabilityofmovingfromonestatetoanotheraftertakingaparticularaction,R:S×A×R→[0,1]istheprobabilityofreceivingarewardwhentakinganaction,andS0:S→[0,1]isthestartingstatedistributiongiving
theprobabilityofstartingineachstate.TheMarkovpropertyrequiresthatthestatesandactionscanbedefinedsothattheprobabilitiesareindependentofthehistoryoftransitionsandrewards.Inadditionweassumetheprobabilitiesareindependentoftime,i.e.stationary.AnMDPincludesanoptimalitycriterion,usuallytomaximizeafunctionofthesumoffuturerewards.Theobjectiveistofindanoptimalpolicyπ:S→Aprovidingtheactiontotakeinanystatetooptimizethevaluefunction.Asemi-MDPisageneralizationofanMDPwhereactionscanbeextended,takingseveraltimestepstocomplete.
Specifyingthetwo-taxiproblemaboveasanMDPweproceedtodefinethestates∈Sbythevariables(t1,t2,p1,p2,d1,d2),wheret1andt2eachspecifytherespectivetaxilocationasoneofthe25gridvalues,p1andp2indicatethelocationofthepassengersateitheroneofthefourpickuplocations,ridingineithertaxiordelivered,andd1andd2eachspecifyoneofthefourdestinationsoftherespectivepassengers.Thetotalnumberofpossiblecollision-freestatesistherefore25×25×7×7×4×4=490,000.Theactionsa∈Aaredefinedbythevariables(a1,a2)representingthesimultaneousjointactionsofthetwotaxis.Thereareeightactionsavailablepertaxigivingajointactionsetof64.Ateachtimestepwespecifyajointrewardof-1untilbothpassengersaredelivered.Maximizingthesumoffuturerewardsachievesourobjectivetominimizethenumberofstepstocompletethetask.
Weassumethelearnercanfullyobservethestate,butthatthetransitionandrewardfunctionsarenotknownbeforehand.WeusesimpleQ-learning[6]tolearntheoptimalpolicyasourbasecase.Q-learninginvolvesupdatinganaction-valuefunctionstoredasaQ-tablewithoneentryforeachstateactionpair.TheupdateperformedaftereachstepiscalledaBellmanbackupand
givenbyQ(s,a)←(1−α)Q(s,a)+α(r+γmaxaQ(s,a))wherethesystemhastransitionedfromstosaftertakingactionaandreceivingrewardr.Thediscountrateforfuturerewardsisγandsetto1.0inourexperiments.Thelearningrateisα.Itiswellknownthatgivensufficientexplorationofallstatesandactionsandsuitablereductioninthelearningratetheaction-valuefunctionwillconvergetoitsoptimalvalue,Q∗,andtheoptimalpoliciesareπ∗(s)=argmaxaQ∗(s,a).Whiletherearemanyspeeduptechniquessuchasprioritizedsweepingandeligibilitytraces,wehaveusedthissimpleformofRLwithα=0.1inallourexperimentsforconsistency.Ourexplorationpolicyis-greedywithgenerallysetto10%.Allperformanceismeasuredusinggreedypolicies.Weareprimarilyinterestedincomparinglearningefficiencywithvariousmodelabstractionsandhavefoundthatsimilarrelationshipsholdforanysettingofαandadequateexploration.3.2
AbstractionsasHomomorphisms
Theideaofanabstractionistosimplifytherepresentationofasysteminsuchawaythattheproblemofinterestcanbesolvedmoreefficiently.Ideallywewouldlikeamodeltoperfectlypreservethesystemcharacteristicsofinterestsothatweonlyneedtosolvethesimplifiedmodelandthesolutionisguaranteedtobea
solutionfortheoriginalsystem.Ausefulalgebraicformalismformodellingab-stractionsisahomomorphism.WeareparticularlyinterestedinhomomorphismsintheframeworkofMDPsandsemi-MDPs.Ahomomorphicmappingfromasystemtoamodelmeansthatifweperformanactionortemporallyextendedactioninasystemstatethentheresultantmodelstateisthesamewhetherwemaptheresultantsystemstatetothemodelorwhetherwefirstmapthesystemstateand(extended)actiontothemodelandperformthemodelac-tion.Statetransitionhomomorphismsbythemselvesdonotguaranteethatthereducedmodelisrelevanttotheproblemathand.AnMDPhomomorphismensuresrelevancybyalsomappingtherewardfunctionandthuspreservingtheimplicitgoalspecification.AusefulexpositionofMDPandsemi-MDPsisgivenbyRavindran[7].
4AbstractionClasses
Westudyseveralexactandapproximatemodelabstractions(homomorphisms)andapplythemtothetwo-taxiproblem.TheresultsareshowninTable1andFigure2andexplainedinthissection.
ThebasecaseQ-tablesizeforthecollision-freetwo-taxiproblemis490,000states×64actions=31,360,000entries.WithcollisionsthebasecaseQ-tablesizeis30,105,600astaxiscannotoccupythesamestate.Wehavecalculatedtheoptimalsolutionusingdynamicprogrammingforbothbasecases.Q-learningonthebasecasetakesoverfourbilliontimestepstoconverge.Wehavedefinedconvergencetobethenumberoftime-stepsrequiredforthemeanoftheresultstobewithin5%ofitsfinalvalue.Q-learningoscillatesslightlyshortofoptimalperformancewiththelearningrateαfixedat0.1(seeTable1,“BaseCase”).Forthebasecaseandotherabstractionswehavefoundthatthemeanconvergedaveragestepsperepisodeovermultiplerunshasastandarddeviationoflessthan0.06.4.1
Symmetry
SymmetriesinMDPsarisewhenstates,actionsoracombinationofbothcanbemappedtoanequivalentreducedMDPthathaslessstatesandactions.Forexample,iftwoactionshaveidenticaltransitionandrewardprobabilityfunctionsbetweenanytwostatesthenoneactioncanbeeliminated.Anexampleofaproblemwithstatesymmetryislearningtoexitsimilarroomsthatdifferonlyincolor.Moregeneralsymmetriesoftenarisewherestate-actionpairsaremappedtogetherbothwithdifferentstatesandactionsfromtheoriginalinthereducedmodel[8].Wecalltheseformsofsymmetrysimplestate-actionsymmetries.Anotherclassofsymmetrythatoverlapstheclassofsimplestate-actionsymmetryisbisimulationhomogeneity[9].Inbisimulationhomogeneitythestatesofaproblemcanbepartitionedintoblockssuchthattheinter-blocktransitionprobabilitiesandrewardprobabilitiesareconstantforallactions.An
Fig.2.Two-taxitaskresults:withoutcollisions(top)andwithcollisions(bottom).
exampleofmodelreductionusingthistypeofsymmetryistheeliminationofanirrelevantrandomvariableinastatedescription.
Thetaxiproblemexhibitssimplestate-actionsymmetryassuminghomogene-ityoftaxisandpassengers.Inoneformofsymmetrytwodifferentnavigationactionsbythetaxiscanbeinterchangedwithoutaffectingthesolutionwhen-everbothtaxisoccupythesamelocation.Ifthetaxisareatdifferentlocationsandtheyareinterchangedthetwosituationsarealsosymmetricalandcanbemappedtothesameabstractstate.Thesecondmappingwillnotreducethestatesifbothtaxisareonthesamecell,asituationthatcanonlyariseinthecollision-freeproblem.Similarly,swappingpassenger-destinationpairscanbeex-ploitedasastate-actionsymmetryabstraction,ascanswappingbothtaxisandpassengers.Specifically,state(t1,t2,p1,p2,d1,d2)withaction(a1,a2)issym-metricallyequivalentto(t2,t1,p1,p2,d1,d2),(a2,a1)wherep1andp2needtobeadjustedwheneverthepassengersareinthetaxis.Similarhomomorphicstate-actionmappingscanbedefinedforapassengerswapandforbothataxiandpassengerswapprovidinga4-to-1statereductionmappingwhentaxiandpassenger-destinationlocationsdiffer.
Theaboveabstractionreducesthetwo-taxiproblemwithoutcollisionsto131,950statescomparedto490,000forthebasecase.Theapproximately3.7-fold
Table1.Memory,learningtimesandperformanceofvariousabstractionsonthetwo-taxitasksforcollisionfree(left)andwithcollisions(right).Bothcasesusethesameabstractions,exceptwithMASandH-MAS,whereitthecasewithcollisionsalsothepositionoftheothertaxi(TOP)isincludedinthestate.collisionfreewith
memorystepstoconvergedmemory(|Q|)convergeave.steps(|Q|)
perepisode
7optimal3.1×10dp14.063.0×1073.0×107basecase3.1×1074.4×10914.46
symmetry8.4×1061.5×10914.457.2×106mas4.2×1037×10515.181.0×105
2.0×105h-joint2.7×1053.4×10716.56
1.1×105h-mas5.7×1039×10514.24
7.5×106sequential7.8×1061.2×10914.26
seq.+sym.2.1×1063.9×10814.271.8×106
collisions
stepstoconvergedconvergeave.steps
perepisode
dp14.469
4.2×1014.80
9
1.3×1014.82
6
9×1016.90
7
4.6×1020.78
7
1×1015.41
9
1.1×1014.76
8
3.5×1014.75
reductioninthenumberofstatesinthereducedmodelleadstoacommensurate
speedupinlearningasillustratedinFigure2.Thesymmetriesalsoapplytothetaxitaskwithcollisionswithsimilarconvergencespeedupresults.ThesestateandactionabstractionsareanexacthomomorphismoftheoriginalMDPandthereforeconvergetoanoptimalpolicyoftheoriginalproblem.
Forthegeneralcollision-freetaskinvolvingntaxisonasizekgridwithmpassengersandlspeciallocationsthenumberofstatesiskn·(l+n+1)m·lm.
(l+n+1)l+m−1k+n−1
states3,whentaxisThiscanbeshowntoabstracttoCn·Cm
andpassengersareindistinguishable.
Anapproachtogeneralizingexactsymmetriesistoconsiderapproximatesymmetries.AboundedparameterMarkovdecisionprocess[10]canbeusedtoboundtheerrorwheninter-blocktransitionsandrewardsprobabilitiesarenotconstantbutconstrainedtoarangeofvalues.Symmetrieswillonlytakeussofarinmodelreduction.Wewillnowturntoanothertypeofstructuralabstraction.4.2
Multiagency
Modellingasystemasasimultaneousinteractionofmultipleautonomousagentscanbeasignificantabstraction[11].Whenaproblemisspecifiedusingamulti-dimensionalactiondescription,itisnaturaltoattemptthistypeofdecomposi-tion.
Inmulti-agencytheoverallhomomorphismisapproximatedbycombiningindependenthomomorphicmappingsforeachagent.Oneofthemajorissuesistheabstractionoftheglobalrewardsignalforeachagentsothattheircombinedbehaviorisinthecommongood[12].Iftheactionsofalltheagentsareinde-pendentandtheglobalrewardisthesumoftherewardsforeachagentthenthemulti-agenthomomorphismisexactandwillproduceanoptimalsolution.
3
nCk(nchoosek)=n!/(k!(n−k)!)
Thistotaldecompositionisrareforinterestingproblemsanddoesnotholdforeitherversionofthetwo-taxitask.Foroptimalperformancethetaxisneedaco-ordinatedstrategyforpickinguppassengersandneedtoavoideachotherwhencollisionispossible.
Onehomomorphicapproximationarbitrarilypre-allocatesonepassengertoeachtaxi.Onemappingis:
21
,(t2,p2,d2)→h((t1,t2,p1,p2,d1,d2)→)=(t1,p1,d1)→
aa
=(t,p,d)→,(t,p,d)→(usingsymmetry)
(a1,a2)aa
Eachagentreceivestheglobalreward.EffectivelywehavedefinedtworeducedMDPs(oneforeachagent)thattogethercomprisethetwo-taxiproblem.WhenthetwoagentsareidenticalwecanadditionallyapplysymmetryfromSection4.1anduseonlyoneMDPforbothagents.Thisnotonlysavesstoragespace,buttransferslearningbetweenthetwoagents.Theactionspertaxicanbereducedbyoneastheynowdonotneedtospecifywhichpassengeristobepickedup.Thismeansthatonly25×6×4statesand7actionsarerequiredtobemodelledreducingthenumberoftableentriesto4,200comparedtotheoriginal31,360,000.ThereducedMDPineffectsolvesDietterich’soriginalsingletaxiproblem.
ThejointactionpolicyrequiredfortheoriginalMDPisformedbycombiningtheactionsfromthereducedMDP,primedrespectivelywiththetwosetsofagentstates.Eachtaxiagentmodelstheworldbyfocussingonitspassengerandignorestheotheragentandpassenger.Itisrewardedfordeliveringitspassenger.
Figure2showstheresultsforamulti-agentsystem(MAS)learnerwiththeaboveabstractionusingthefixedallocationpolicy.Fixingtheallocationofthetaxistopassengersisclearlysuboptimal.Wecandobetterifwelearnajointdispatchstrategyasahigherleveldecisiontaskasdescribedinthenextsection.Forthecasewithcollisionsthepreviousmodelofthepartiallyobservableenvironmentfromthepointofviewofonetaxiisnowinadequate.Collisionsdependontheothertaxithatcannotbeobserved.ThemodelisnolongerMarkovandthehomomorphismfails.Whentheoutcomeforoneagentdependsontheactionsoftheothersthenitisnecessarytoincludeextrastateinformationtotrytoretaintheindividualagenthomomorphicmappings4.Toaddressthissituationwehaveincludedtheothertaxi’sposition(TOP)asanintegralpartofthestatedescriptionforeachagent.Thisisanapproximation,modellingtheotheragent’smovementsasstochastic.TheresultingreducedmodelrequirestheleastamountofstorageinourexperimentswithcollisionandproducesreasonablygoodresultsasshowninFigure2MAS(TOP).Theabstractionalsorequiresustodispatchthetaxistoafictitiouspassengerpickuplocationaftertheycompletetheirmissiontoavoideachtaxiinadvertentlyblockingtheother.
4
However,evenwhenallinformationistakenintoaccountandtheindividualagentmodelsareexact,ifagentsbehavegreedilytheoverallabstractionmayonlybeap-proximateandproducesuboptimalsolutions.ThiseffectiswellknownasVoter’sParadoxortheTragedyoftheCommonsandhasconsequencessuchasBraessPara-dox[13]wheremoreoptionscanreduceoverallperformance.
TherequirementforabetterdispatchingpolicyandDietterich’soriginaldecompositionofthesingletaxiproblemsuggestwecandobetterbyturningtoanotherformofstructuralabstraction–hierarchicaldecomposition.4.3
TaskHierarchies
Aproblemmaybeabletobedecomposedintoataskhierarchyinwhicheachparenttaskcanchoosetoexecuteanyofitschildsubtasks.DietterichintroducedtheMAXQdecompositiontocompactlyrepresentavaluefunctionoverataskhierarchywithreusablesubtasks[5].Thewholetaskhierarchyisasemi-MDPhomomorphismthatmapschildtaskpoliciestotemporallyextendedactionsattherootnode.Thehomomorphismisingeneralapproximateinthatataskhier-archymayconstrainthepoliciesavailableintheoriginalproblem.Indeed,eachparenttaskinthehierarchycanbeinterpretedasasub-modeloftheproblemusingasemi-MDPhomomorphismwherechildpoliciesaremappedasthepar-ent’sabstractactions.Learntfromthebottomup,ataskhierarchyimplementsthe“frontiersofreceptivity”[3]inamulti-layeredlearningparadigm.
Onesuchtaskhierarchydecomposesthetwo-taxiproblemintotwolevelsofsubtasksasshowninFigure3.Thelowerlevelnavigationsubtaskswithstates(t1,t2)andactions(a1,a2)hastheobjectiveofjointlymovingthetaxistothefourspeciallocations.Astheactionsarejointtheycanhandlecollisions.Thejointnavigationsubtasksterminatewhenbothtaxishavereachedtheirdes-ignatedspeciallocations.IfthetaxisareinterpretedasseparateagentsthenthismodelimplicitlyimplementstheconcurrentactionmodelTallterminationscheme[14].Theterminationschemerequiresagentstoterminatesimultane-ouslyandisclearlysuboptimal.However,forcedsynchronyreducesthenumberofsub-MDPsandabstractactionsrequiredinthehierarchy.Sixteensub-MPDsareneededatleveloneforthecollision-freeproblemand12withcollisions,rep-resentingthevariousterminationcombinationsatthefourspeciallocationsforthetwotaxis.
Theleveltwodispatchsubtaskusesstates(p1,p2,d1,d2)andpoliciesfromlevelonegenerating108and144abstractactionsonthetaskwithandwithoutcollisionsrespectively.Theserepresentthecombinationofpickupandputdownactionsuponterminationofthesub-MDPs.Wehaveusedasemi-automaticver-sionoftheHEXQalgorithm[15]forourimplementation.Resultsarelabelled“H-Joint”inTable1andFigure2andshowabouttwoordersofmagnitudesavingsinbothmemoryandlearningtimebutareclearlysuboptimalasex-pected.TheHEXQalgorithmconstructsthetaskhierarchywhilelearningbutitsvariable-wisedecompositiondoesnotproduceamulti-agentdecomposition.Thelargenumberofabstractactionsmakesexecutionexpensive.Thistypeofjointactionhierarchycouldbeabstractedfurtherbyusingsymmetryasdis-cussedinSection4.1butthiswasnotimplemented.
OursecondhierarchicalabstractionisshowninFigure3andbuildsonthemulti-agentabstractionfromtheprevioussection.Inthisthreeleveltaskhier-archythelevelonesubtasksaretheindividualtaxiagentsperformingthewholedeliverytask.WeonlyuseoneQ-tableforbothtaxisasdiscussedinSection
Level 3rootselect dispatchstrategystate = (p1,d2,p1,d2)DispatcherLevel 2state = (p1,d1,p2,d2)abstract actions= (special location, pickup/putdown)taxi 1 –pass.1taxi 2 –pass. 2taxi 1 –pass.2taxi 2 –pass. 1taxi 1 –pass.1taxi 1 –pass. 2taxi 2 –pass.1taxi 2 –pass. 2Level 2dispatchallocationsJoint navigationLevel 1joint actions(deliver t1,px,py),(deliver t2,px,py)state = (t1,t2))state = (t1,t2state = (t1,t2)Level 1Joint pickup and delivery of passengersstate = (t, p, d) orstate = (t1, p, d)(t, p, d, top) with collisionsprimitive actionsprimitive action = (a1,a2)(a)(b)
Fig.3.Dispatcherandjointnavigationtaskhierarchy(H-Joint)isshownin(a);(b)
showsthree-levelhierarchyusingmulti-agenttravellingsalesmandispatcherandwholedeliverytaskatlowerlevel(H-MAS).
4.2.Thesecondleveltasksrepresentthevariousdispatchingoptionsforthetwotaxis.Therearefouroptionsaltogetherallocatingeachtaxitodeliverbothpassengersoroneortheother.Intheeventthatonetaxiistodeliverbothpas-sengers,theleveltwotaskslearnthebestorderofdelivery.Thisproblemisakintoamulti-agenttravellingsalesmanproblem(TSP)inwhicheachcityneedstobevisitedbyoneagentonly.Acityvisitcorrespondstoapassengerpickupanddelivery.Theroottasksimplychoosesoneofthedispatchstrategies.
TheresultaredepictedinFigure2andTable1asH-MASandH-MAS(TOP).Thistaskhierarchygivesnearoptimalresultsforthecollision-freeversionoftheproblem.Itissuboptimalbecause,duetostochasticdrift,itispossiblethattheoptimaldispatchdecisionshouldbeswitchedpriortopickupbutislockedinbythesecondlevelcontrollerinthehierarchy.Weconjecturetobeabletoimprovetheresults,achievingoptimalityinthiscase,byusinghierarchicalgreedyexecution(polling)[16,5].
TheH-MASstructuralabstractionscombinehierarchy,multi-agencyandsymmetryandachievethebestresourceeffectivenesswithandwithoutcolli-sions.Interestingly,storagecanbereducedfurtherbymorethanafactoroffourandlearningtimebyatleastafactoroftwobydecomposingthelevelonesubtaskinFigure3intothreelevelsasfortheoriginaltaxitask[5],creatingafivelevelhierarchicaldecompositionforthetwo-taxitask.4.4
SequentialExecution
Theideabehindsequentialexecutionisto“simulate”theconcurrentactionoftheagentsbytimesharingtheirexecution.Thismeansthattheagentsactoneatatimeasfarasthelearnerisconcerned,butareassumedtobeabletoexecutejointlyinarealworld.
Thisisasemi-MDPhomomorphisminwhichonejointactionismappedtoatwosteptemporallyextendaction.Theadvantageisthatthenumberofactionsisthesumratherthantheproductofthenumberofactionsoftheagents.Ineffectweareassumingthattheproblemcanbeapproximatedwithareducedactionsetwithjointactions(a1,stay)and(stay,a2).
Withandwithoutcollisions,thisreducesthememoryuseandlearningtimebyaboutafactoroffourandachievesnearoptimalresults.Incombinationwithsymmetryweachievethemultiplicativebenefitofbothabstractions.Theseresultsarelabelled“Sequential”and“Seq.+Sym.”inTable1andFigure2.
5AutomaticDiscoveryofAbstractions
Oneofthemajorlessonsfromimplementingtheseabstractions,evenonthisrelativelysimpleexample,isthedifficultyinmanuallyspecifyingthevariousdecompositions.Forexample,thesymmetryabstractionsrequirebothstatesandactionstobemappedtogether,butitisnotstraightforwardtospecifythecompletehomomorphicmappinginSection4.1.Whenevertheuserprovidesanabstractionweareeffectivelyprovidingbackgroundinformationtothelearnerandthismustbemanuallytailoredforeachproblem.Boththeseissuesmotivateautomatingthediscoveryofstructuralabstractions.
Ourapproachtoautomationwouldtrytoexploitanystructureinaproblembysearchingforsymmetry,multi-agentandhierarchicalabstractions.Wemay,forexample,beabletofindsub-modelsthatarewelldefinedhomomorphismscoveringpartsofthewholeproblem.Thesesub-modelsmaybeabletobefoundbysimultaneouslyexploringtheprojectedspacesdefinedbysubsetsofstate-actionvariablesandtestingforMarkovtransitionsandrewards.Thesub-modelsbecomethebuildingblocksforasearchforevenmoreabstractmodelsatahigherlevel.Inthiswayitisenvisagedthatataskhierarchycouldbeshapedfromthebottomuptosolvetheproblemefficiently.
6DiscussionandFutureWork
ThispaperpresentsclassesofabstractionusefulforscalingupRLandpointstowardspromisingfutureworkintwodirections–themanualuseofstructuralabstractions,andtheirautomaticdiscovery.Ourexperimentalresultsshowthattheperformancegainsofsingleabstractionscanbefurtherimprovedthroughtheircombination.Ourbestresultscombinesymmetry,taskhierarchyandmulti-agency.WehavealreadyindicatedaclearopportunityforimprovingtheseresultsinSection4.3andbelievetherearefurthermodelreductionandscalingopportu-nitiesavailable.Themodeloftheotheragent(TOP)inourimplementationusesaglobalpositionvariable.Weconjecturethatitispossibletoparsimoniouslyuseonlya“windowofinfluence”fortheTOPmodel,possiblyadeicticrepresen-tation,thatwouldreducememoryrequirementssignificantly.Further,giventhecorrespondencetomulti-agentTSP,thislogisticproblemshouldscaletractablywithgoodresultsusingmanyvehiclesandcargoinmuchlargersimulations.
Theexpectedgainsfromabstractionsdependonthedegreeofsystemcou-pling.Intheexperimentswefoundthatincreasinginterdependencyreducesthenumberofreachablestatesbecauseoftheaddedconstraintsbutalsocreateslessopportunityforabstraction.
TheexperimentsandforgoingreasoninggivesomehopethatwecanscaleRLtolargerproblemsbyusingmanuallydefinedabstractions.Theunifyingframeworkoftreatingmodelreductionasahomomorphismsuitsandsupportsstructuralabstractionwell.Ourexperimentsalsoindicatethefeasibilityofde-velopingmethodsforautomaticdiscoveryofabstractionsasoutlinedinSection5.Wearecurrentlyexploringbothoftheseapproachesincomplexdomainssuchasbattle-spacesimulations.
References
1.Sutton,R.S.,Barto,A.G.:ReinforcementLearning:AnIntroduction.MITPress,Cambridge,Massachusetts(1998)
2.Clark,A.,Thornton,C.:Tradingspaces:Computation,representation,andthelimitsofuninformedlearning.BehavioralandBrainSciences20(1997)57–663.Utgoff,P.E.,Stracuzzi,D.J.:Many-layeredlearning.In:NeuralComputation,MITPressJournals(2002)
4.Ashby,R.:IntroductiontoCybernetics.Chapman&Hall,London(1956)
5.Dietterich,T.G.:HierarchicalreinforcementlearningwiththeMAXQvaluefunc-tiondecomposition.JournalofArtificialIntelligenceResearch13(2000)227–3036.Watkins,C.J.C.H.:LearningfromDelayedRewards.PhDthesis,King’sCollege(1989)
7.Ravindran,B.,Barto,A.G.:SMDPhomomorphisms:Analgebraicapproachtoabstractioninsemimarkovdecisionprocesses.In:Proc.oftheEighteenthInter-nationalJointConferenceonArtificialIntelligence(IJCAI03).(2003)1011–10188.Ravindran,B.,Barto,A.G.:Modelminimizationinhierarchicalreinforcementlearning.In:FifthSymposiumonAbstraction,ReformulationandApproximation(SARA2002).LNCS,SpringerVerlag(2002)196–211
9.Dean,T.,Givan,R.:Modelminimizationinmarkovdecisionprocesses.In:AAAI/IAAI.(1997)106–111
10.Givan,R.,Leach,S.M.,Dean,T.:Bounded-parametermarkovdecisionprocesses.
ArtificialIntelligence122(2000)71–109
11.Crites,R.H.,Barto,A.G.:Elevatorgroupcontrolusingmultiplereinforcement
learningagents.MachineLearning33(1998)235–262
12.Wolpert,D.,Tumer,K.:Anintroductiontocollectiveintelligence.Technical
ReportNASA-ARC-IC-99-63,NASAAmesResearchCenter,CA(1999)
¨13.Braess,D.:UbereinParadoxonderVerkehrsplanung.Unternehmensforschung12
(1968)258–268
14.Rohanimanesh,K.,Mahadevan,S.:Learningtotakeconcurrentactions.In:NIPS.
(2002)1619–1626
15.Hengst,B.:DiscoveringhierarchyinreinforcementlearningwithHEXQ.InSam-mut,C.,Hoffmann,A.,eds.:ProceedingsoftheNineteenthInternationalConfer-enceonMachineLearning,Morgan-Kaufman(2002)243–250
16.Kaelbling,L.P.:Hierarchicallearninginstochasticdomains:Preliminaryresults.
In:MachineLearningProceedingsoftheTenthInternationalConference,SanMa-teo,CA,MorganKaufmann(1993)167–173
因篇幅问题不能全部显示,请点此查看更多更全内容