您的当前位置:首页正文

Structural Abstraction Experiments in Reinforcement Learning

2023-01-03 来源:欧得旅游网
StructuralAbstractionExperimentsin

ReinforcementLearning

ˇ1,GregCalbert2,andJasonRobertFitch1,BernhardHengst1,DorianSuc

Scholz2

NationalICTAustralia,UniversityofNSW,Australia{robert.fitch,bernhard.hengst,dorian.suc}@nicta.com.au

DefenceScienceandTechnologyOrganization,SalisburySAAustralia

{greg.calbert,jason.scholz}@dsto.defence.gov.au

1

2

Abstract.Achallengeinapplyingreinforcementlearningtolargeprob-lemsishowtomanagetheexplosiveincreaseinstorageandtimecom-plexity.Thisisespeciallyproblematicinmulti-agentsystems,wherethestatespacegrowsexponentiallyinthenumberofagents.Functionap-proximationbasedonsimplesupervisedlearningisunlikelytoscaletocomplexdomainsonitsown,butstructuralabstractionthatexploitssys-tempropertiesandproblemrepresentationsshowsmorepromise.Inthispaper,weinvestigateseveralclassesofknownabstractions:1)symme-try,2)decompositionintomultipleagents,3)hierarchicaldecomposition,and4)sequentialexecution.Wecomparememoryrequirements,learningtime,andsolutionqualityempiricallyintwoproblemvariations.Ourre-sultsindicatethatthemosteffectivesolutionscomefromcombinationsofstructuralabstractions,andencouragedevelopmentofmethodsforautomaticdiscoveryinnovelproblemformulations.

1Introduction

Whenspecifyingaproblemsuchaslearningtowalk,learningtomanipulateobjectsorlearningtoplayagameasareinforcementlearning(RL)problem,thenumberofstatesandactionsisoftentoolargeforthelearnertomanage.Itisstraightforwardtodescribethestateofasystemusingseveralvariablestorepresentthevariousattributesofitscomponents.Similarly,individualsystemcomponentactionscanbeusedtodescribeajointactionvector.However,thisapproachveryquicklyleadstoanintractablespecificationoftheRLproblem.Atabularstate-actionrepresentationrequiresstorageproportionaltotheproductofthesizeofallthestateandactionvariables,leadingtointractablestorageandtimecomplexity.

Functionapproximationcanoftenhelpbygeneralizingthevaluefunctionacrossmanystates.Functionapproximationisbasedonsupervisedlearning.Gradientdescentmethods,suchasartificialneuralnetworksandlinearfunctionapproximationarefrequentlyusedinRLforthispurpose[1].However,therearereasonstobelievethatsimplefunctionapproximationwillnotscaletolarger,morecomplex,problems.

Someproblemsmayhavestructureandregularitiesthatarenotreadilyap-parentanddifficulttolearninonestep[2].Learningmaybemadetractablebytherightproblemdecompositionandbylearninginstages,reusingandcombin-ingpreviouslylearntconcepts.Learningincomplexenvironmentsproceedsatthe“frontierofreceptivity”wherenewtaskscanonlybelearntoncecomponentchildtaskshavebeenlearnt[3].

Ourobjectiveinthispaperistoexploreseveralstructuralabstractionmeth-odsandtostudytheireffectonstoragerequirements,learningtimeandthequal-ityofthesolutionforoneproblemwithtwodifferentlevelsofinter-dependency.Thecontributionofthispaperistogiveasystematicdescriptionandexperi-mentalevaluationofacollectionofknownstructuralabstractionsthatcanbeusedinawidevarietyofRLproblems.Ourmotivationcomesinpartfromthedesiretodeveloptractabledecisionsystemsforcomplexmulti-agentgames,andsecondlytogaininsightforautomatingthemodellingprocess.

Ourexperimentsinthispaperarebasedonataskthatrequirestwotaxisactingtogethertopickupanddelivertwopassengersonthe5×5grid.Thistaskischosenbecauseitisintuitiveandsmallenoughsothatwecanstillcomputetheoptimalsolution,yetitiscomplexenoughtomakeitchallengingtodemonstratevariousabstractions.

TheformalismsunderpinningtheexperimentsinthispaperarethoseofMarkovdecisionproblems(MDPs)andmodelreductionsexpressedashomomor-phicmappings.Thelatterareimportantinestablishingwhetheronesystemisamodelofanotherandwhichpropertiesoftheoriginalsystemthemodelretains[4].Theabstractionswewillexplorearesymmetryreductions,multi-agentde-compositions,hierarchicaldecompositions,sequentialexecutionandsomecom-binations.

Intherestofthispaperwewillfirstintroducethetwo-taxitaskasourworkingexampleandsummarizetheformalisms.Wetheninturndescribethevariousstructuralabstractionsoftheproblemincludingrelatedwork.Ineachcasewediscussthecomputingrequirementsinrelationtosolutionqualityforthetaxitaskandcommentonthescalingpotential.

2TheTwo-TaxiProblem

Taxiproblemshavebeenusedbyseveralresearchersinrelatedwork.Theyhaveanaloguesinotherproblems,suchaslogisticproblemsinvolvingtransportersandcargo.Wehavetakentwotaxis[5]workingjointlytopickupanddelivertwopassengersonthe5×5gridshowninFigure1.

Theobjectiveistodeliverbothpassengersintheminimumnumberofsteps.AteachtimestepataxicantakeanavigationactionandattemptamoveonepositionNorth,South,EastorWest.Theseactionsarestochastic.With92.5%probabilitytheymovethetaxiintheintendeddirectionandwith7.5%probabilitytheymovethetaxirandomlyinoneoftheotherthreedirections.Otheractionsavailablearetopickuppassengerone,pickuppassengertwo,putdownitspassengerorstay.Taxiscanonlycarryonepassengeratatime.Amove

RGYBFig.1.Thetwo-taxiProblem.

intoabarriersorthegridboundaryiscountedasastepandresultsinthetaxistayingwhereitis.TherearefourspeciallocationsmarkedR,G,Y,Bthatrepresentthepossiblesourceanddestinationlocationsofthepassengers.Atthecommencementofeveryepisodeeachpassengerislocatedatrandomononeofthespeciallocationsandeachtaxiislocatedatrandomanywhereonthegrid.Eachpassengermustbepickedupbyoneofthetaxisandputdownathisorherdestinationlocation,evenifthesourceanddestinationlocationsarethesame.Ifapassengeriswaitingtobepickedupandbothtaxisareatthislocationthenajointactionbybothtaxistopickupthepassengerresultsinonlytaxionebeingsuccessful.

Weconsidertwoversionsoftheproblem,withandwithouttaxicollisions.Withoutcollisionsthetaxisareabletooccupythesamecellandpasseachotherwithoutimpediment.Withcollisionsthetaxiscannotoccupythesamecellnorcantheypasseachotherinthesensethattheycannotoccupyeachother’sformercellfollowingajointaction.Amoveresultinginacollisionleavesthetaxilocationsunchanged.Initialtaxipositionsonthesamecellaredisallowedwithcollisions,althoughpassengerscanhavethesamepickupanddestinationslocations.

3ModellingFormalisms

WenowintroduceRL,itsapplicationtothetwo-taxitaskandabstractionsformalizedashomomorphisms.3.1

RLProblemFormulation

UnderlyingRListheMarkovdecisionproblemframework.Weconsiderasys-temmodelledatabaselevelbyasetofstates.Ateachtime-stepthesystemtransitionsfromstatetostatedependingonaninputactiontothesystemandgeneratingareal-valuedreward.WecandescribeaMarkovdecisionproblem(MDP)asatuplewhereSisthefinitesetofstates,Aisthefinitesetofactions,T:S×A×S→[0,1]isthetransitionfunctiongivingtheprobabilityofmovingfromonestatetoanotheraftertakingaparticularaction,R:S×A×R→[0,1]istheprobabilityofreceivingarewardwhentakinganaction,andS0:S→[0,1]isthestartingstatedistributiongiving

theprobabilityofstartingineachstate.TheMarkovpropertyrequiresthatthestatesandactionscanbedefinedsothattheprobabilitiesareindependentofthehistoryoftransitionsandrewards.Inadditionweassumetheprobabilitiesareindependentoftime,i.e.stationary.AnMDPincludesanoptimalitycriterion,usuallytomaximizeafunctionofthesumoffuturerewards.Theobjectiveistofindanoptimalpolicyπ:S→Aprovidingtheactiontotakeinanystatetooptimizethevaluefunction.Asemi-MDPisageneralizationofanMDPwhereactionscanbeextended,takingseveraltimestepstocomplete.

Specifyingthetwo-taxiproblemaboveasanMDPweproceedtodefinethestates∈Sbythevariables(t1,t2,p1,p2,d1,d2),wheret1andt2eachspecifytherespectivetaxilocationasoneofthe25gridvalues,p1andp2indicatethelocationofthepassengersateitheroneofthefourpickuplocations,ridingineithertaxiordelivered,andd1andd2eachspecifyoneofthefourdestinationsoftherespectivepassengers.Thetotalnumberofpossiblecollision-freestatesistherefore25×25×7×7×4×4=490,000.Theactionsa∈Aaredefinedbythevariables(a1,a2)representingthesimultaneousjointactionsofthetwotaxis.Thereareeightactionsavailablepertaxigivingajointactionsetof64.Ateachtimestepwespecifyajointrewardof-1untilbothpassengersaredelivered.Maximizingthesumoffuturerewardsachievesourobjectivetominimizethenumberofstepstocompletethetask.

Weassumethelearnercanfullyobservethestate,butthatthetransitionandrewardfunctionsarenotknownbeforehand.WeusesimpleQ-learning[6]tolearntheoptimalpolicyasourbasecase.Q-learninginvolvesupdatinganaction-valuefunctionstoredasaQ-tablewithoneentryforeachstateactionpair.TheupdateperformedaftereachstepiscalledaBellmanbackupand

󰀂󰀂

givenbyQ(s,a)←(1−α)Q(s,a)+α(r+γmax󰀂aQ(s,a))wherethesystemhastransitionedfromstos󰀂aftertakingactionaandreceivingrewardr.Thediscountrateforfuturerewardsisγandsetto1.0inourexperiments.Thelearningrateisα.Itiswellknownthatgivensufficientexplorationofallstatesandactionsandsuitablereductioninthelearningratetheaction-valuefunctionwillconvergetoitsoptimalvalue,Q∗,andtheoptimalpoliciesareπ∗(s)=argmaxaQ∗(s,a).Whiletherearemanyspeeduptechniquessuchasprioritizedsweepingandeligibilitytraces,wehaveusedthissimpleformofRLwithα=0.1inallourexperimentsforconsistency.Ourexplorationpolicyis󰀂-greedywith󰀂generallysetto10%.Allperformanceismeasuredusinggreedypolicies.Weareprimarilyinterestedincomparinglearningefficiencywithvariousmodelabstractionsandhavefoundthatsimilarrelationshipsholdforanysettingofαandadequateexploration.3.2

AbstractionsasHomomorphisms

Theideaofanabstractionistosimplifytherepresentationofasysteminsuchawaythattheproblemofinterestcanbesolvedmoreefficiently.Ideallywewouldlikeamodeltoperfectlypreservethesystemcharacteristicsofinterestsothatweonlyneedtosolvethesimplifiedmodelandthesolutionisguaranteedtobea

solutionfortheoriginalsystem.Ausefulalgebraicformalismformodellingab-stractionsisahomomorphism.WeareparticularlyinterestedinhomomorphismsintheframeworkofMDPsandsemi-MDPs.Ahomomorphicmappingfromasystemtoamodelmeansthatifweperformanactionortemporallyextendedactioninasystemstatethentheresultantmodelstateisthesamewhetherwemaptheresultantsystemstatetothemodelorwhetherwefirstmapthesystemstateand(extended)actiontothemodelandperformthemodelac-tion.Statetransitionhomomorphismsbythemselvesdonotguaranteethatthereducedmodelisrelevanttotheproblemathand.AnMDPhomomorphismensuresrelevancybyalsomappingtherewardfunctionandthuspreservingtheimplicitgoalspecification.AusefulexpositionofMDPandsemi-MDPsisgivenbyRavindran[7].

4AbstractionClasses

Westudyseveralexactandapproximatemodelabstractions(homomorphisms)andapplythemtothetwo-taxiproblem.TheresultsareshowninTable1andFigure2andexplainedinthissection.

ThebasecaseQ-tablesizeforthecollision-freetwo-taxiproblemis490,000states×64actions=31,360,000entries.WithcollisionsthebasecaseQ-tablesizeis30,105,600astaxiscannotoccupythesamestate.Wehavecalculatedtheoptimalsolutionusingdynamicprogrammingforbothbasecases.Q-learningonthebasecasetakesoverfourbilliontimestepstoconverge.Wehavedefinedconvergencetobethenumberoftime-stepsrequiredforthemeanoftheresultstobewithin5%ofitsfinalvalue.Q-learningoscillatesslightlyshortofoptimalperformancewiththelearningrateαfixedat0.1(seeTable1,“BaseCase”).Forthebasecaseandotherabstractionswehavefoundthatthemeanconvergedaveragestepsperepisodeovermultiplerunshasastandarddeviationoflessthan0.06.4.1

Symmetry

SymmetriesinMDPsarisewhenstates,actionsoracombinationofbothcanbemappedtoanequivalentreducedMDPthathaslessstatesandactions.Forexample,iftwoactionshaveidenticaltransitionandrewardprobabilityfunctionsbetweenanytwostatesthenoneactioncanbeeliminated.Anexampleofaproblemwithstatesymmetryislearningtoexitsimilarroomsthatdifferonlyincolor.Moregeneralsymmetriesoftenarisewherestate-actionpairsaremappedtogetherbothwithdifferentstatesandactionsfromtheoriginalinthereducedmodel[8].Wecalltheseformsofsymmetrysimplestate-actionsymmetries.Anotherclassofsymmetrythatoverlapstheclassofsimplestate-actionsymmetryisbisimulationhomogeneity[9].Inbisimulationhomogeneitythestatesofaproblemcanbepartitionedintoblockssuchthattheinter-blocktransitionprobabilitiesandrewardprobabilitiesareconstantforallactions.An

Fig.2.Two-taxitaskresults:withoutcollisions(top)andwithcollisions(bottom).

exampleofmodelreductionusingthistypeofsymmetryistheeliminationofanirrelevantrandomvariableinastatedescription.

Thetaxiproblemexhibitssimplestate-actionsymmetryassuminghomogene-ityoftaxisandpassengers.Inoneformofsymmetrytwodifferentnavigationactionsbythetaxiscanbeinterchangedwithoutaffectingthesolutionwhen-everbothtaxisoccupythesamelocation.Ifthetaxisareatdifferentlocationsandtheyareinterchangedthetwosituationsarealsosymmetricalandcanbemappedtothesameabstractstate.Thesecondmappingwillnotreducethestatesifbothtaxisareonthesamecell,asituationthatcanonlyariseinthecollision-freeproblem.Similarly,swappingpassenger-destinationpairscanbeex-ploitedasastate-actionsymmetryabstraction,ascanswappingbothtaxisandpassengers.Specifically,state(t1,t2,p1,p2,d1,d2)withaction(a1,a2)issym-metricallyequivalentto(t2,t1,p1,p2,d1,d2),(a2,a1)wherep1andp2needtobeadjustedwheneverthepassengersareinthetaxis.Similarhomomorphicstate-actionmappingscanbedefinedforapassengerswapandforbothataxiandpassengerswapprovidinga4-to-1statereductionmappingwhentaxiandpassenger-destinationlocationsdiffer.

Theaboveabstractionreducesthetwo-taxiproblemwithoutcollisionsto131,950statescomparedto490,000forthebasecase.Theapproximately3.7-fold

Table1.Memory,learningtimesandperformanceofvariousabstractionsonthetwo-taxitasksforcollisionfree(left)andwithcollisions(right).Bothcasesusethesameabstractions,exceptwithMASandH-MAS,whereitthecasewithcollisionsalsothepositionoftheothertaxi(TOP)isincludedinthestate.collisionfreewith

memorystepstoconvergedmemory(|Q|)convergeave.steps(|Q|)

perepisode

7optimal3.1×10dp14.063.0×1073.0×107basecase3.1×1074.4×10914.46

symmetry8.4×1061.5×10914.457.2×106mas4.2×1037×10515.181.0×105

2.0×105h-joint2.7×1053.4×10716.56

1.1×105h-mas5.7×1039×10514.24

7.5×106sequential7.8×1061.2×10914.26

seq.+sym.2.1×1063.9×10814.271.8×106

collisions

stepstoconvergedconvergeave.steps

perepisode

dp14.469

4.2×1014.80

9

1.3×1014.82

6

9×1016.90

7

4.6×1020.78

7

1×1015.41

9

1.1×1014.76

8

3.5×1014.75

reductioninthenumberofstatesinthereducedmodelleadstoacommensurate

speedupinlearningasillustratedinFigure2.Thesymmetriesalsoapplytothetaxitaskwithcollisionswithsimilarconvergencespeedupresults.ThesestateandactionabstractionsareanexacthomomorphismoftheoriginalMDPandthereforeconvergetoanoptimalpolicyoftheoriginalproblem.

Forthegeneralcollision-freetaskinvolvingntaxisonasizekgridwithmpassengersandlspeciallocationsthenumberofstatesiskn·(l+n+1)m·lm.

(l+n+1)l+m−1k+n−1

states3,whentaxisThiscanbeshowntoabstracttoCn·Cm

andpassengersareindistinguishable.

Anapproachtogeneralizingexactsymmetriesistoconsiderapproximatesymmetries.AboundedparameterMarkovdecisionprocess[10]canbeusedtoboundtheerrorwheninter-blocktransitionsandrewardsprobabilitiesarenotconstantbutconstrainedtoarangeofvalues.Symmetrieswillonlytakeussofarinmodelreduction.Wewillnowturntoanothertypeofstructuralabstraction.4.2

Multiagency

Modellingasystemasasimultaneousinteractionofmultipleautonomousagentscanbeasignificantabstraction[11].Whenaproblemisspecifiedusingamulti-dimensionalactiondescription,itisnaturaltoattemptthistypeofdecomposi-tion.

Inmulti-agencytheoverallhomomorphismisapproximatedbycombiningindependenthomomorphicmappingsforeachagent.Oneofthemajorissuesistheabstractionoftheglobalrewardsignalforeachagentsothattheircombinedbehaviorisinthecommongood[12].Iftheactionsofalltheagentsareinde-pendentandtheglobalrewardisthesumoftherewardsforeachagentthenthemulti-agenthomomorphismisexactandwillproduceanoptimalsolution.

3

nCk(nchoosek)=n!/(k!(n−k)!)

Thistotaldecompositionisrareforinterestingproblemsanddoesnotholdforeitherversionofthetwo-taxitask.Foroptimalperformancethetaxisneedaco-ordinatedstrategyforpickinguppassengersandneedtoavoideachotherwhencollisionispossible.

Onehomomorphicapproximationarbitrarilypre-allocatesonepassengertoeachtaxi.Onemappingis:

21

,(t2,p2,d2)→h((t1,t2,p1,p2,d1,d2)→)=(t1,p1,d1)→

aa

=(t,p,d)→,(t,p,d)→(usingsymmetry)

(a1,a2)aa

Eachagentreceivestheglobalreward.EffectivelywehavedefinedtworeducedMDPs(oneforeachagent)thattogethercomprisethetwo-taxiproblem.WhenthetwoagentsareidenticalwecanadditionallyapplysymmetryfromSection4.1anduseonlyoneMDPforbothagents.Thisnotonlysavesstoragespace,buttransferslearningbetweenthetwoagents.Theactionspertaxicanbereducedbyoneastheynowdonotneedtospecifywhichpassengeristobepickedup.Thismeansthatonly25×6×4statesand7actionsarerequiredtobemodelledreducingthenumberoftableentriesto4,200comparedtotheoriginal31,360,000.ThereducedMDPineffectsolvesDietterich’soriginalsingletaxiproblem.

ThejointactionpolicyrequiredfortheoriginalMDPisformedbycombiningtheactionsfromthereducedMDP,primedrespectivelywiththetwosetsofagentstates.Eachtaxiagentmodelstheworldbyfocussingonitspassengerandignorestheotheragentandpassenger.Itisrewardedfordeliveringitspassenger.

Figure2showstheresultsforamulti-agentsystem(MAS)learnerwiththeaboveabstractionusingthefixedallocationpolicy.Fixingtheallocationofthetaxistopassengersisclearlysuboptimal.Wecandobetterifwelearnajointdispatchstrategyasahigherleveldecisiontaskasdescribedinthenextsection.Forthecasewithcollisionsthepreviousmodelofthepartiallyobservableenvironmentfromthepointofviewofonetaxiisnowinadequate.Collisionsdependontheothertaxithatcannotbeobserved.ThemodelisnolongerMarkovandthehomomorphismfails.Whentheoutcomeforoneagentdependsontheactionsoftheothersthenitisnecessarytoincludeextrastateinformationtotrytoretaintheindividualagenthomomorphicmappings4.Toaddressthissituationwehaveincludedtheothertaxi’sposition(TOP)asanintegralpartofthestatedescriptionforeachagent.Thisisanapproximation,modellingtheotheragent’smovementsasstochastic.TheresultingreducedmodelrequirestheleastamountofstorageinourexperimentswithcollisionandproducesreasonablygoodresultsasshowninFigure2MAS(TOP).Theabstractionalsorequiresustodispatchthetaxistoafictitiouspassengerpickuplocationaftertheycompletetheirmissiontoavoideachtaxiinadvertentlyblockingtheother.

4

However,evenwhenallinformationistakenintoaccountandtheindividualagentmodelsareexact,ifagentsbehavegreedilytheoverallabstractionmayonlybeap-proximateandproducesuboptimalsolutions.ThiseffectiswellknownasVoter’sParadoxortheTragedyoftheCommonsandhasconsequencessuchasBraessPara-dox[13]wheremoreoptionscanreduceoverallperformance.

TherequirementforabetterdispatchingpolicyandDietterich’soriginaldecompositionofthesingletaxiproblemsuggestwecandobetterbyturningtoanotherformofstructuralabstraction–hierarchicaldecomposition.4.3

TaskHierarchies

Aproblemmaybeabletobedecomposedintoataskhierarchyinwhicheachparenttaskcanchoosetoexecuteanyofitschildsubtasks.DietterichintroducedtheMAXQdecompositiontocompactlyrepresentavaluefunctionoverataskhierarchywithreusablesubtasks[5].Thewholetaskhierarchyisasemi-MDPhomomorphismthatmapschildtaskpoliciestotemporallyextendedactionsattherootnode.Thehomomorphismisingeneralapproximateinthatataskhier-archymayconstrainthepoliciesavailableintheoriginalproblem.Indeed,eachparenttaskinthehierarchycanbeinterpretedasasub-modeloftheproblemusingasemi-MDPhomomorphismwherechildpoliciesaremappedasthepar-ent’sabstractactions.Learntfromthebottomup,ataskhierarchyimplementsthe“frontiersofreceptivity”[3]inamulti-layeredlearningparadigm.

Onesuchtaskhierarchydecomposesthetwo-taxiproblemintotwolevelsofsubtasksasshowninFigure3.Thelowerlevelnavigationsubtaskswithstates(t1,t2)andactions(a1,a2)hastheobjectiveofjointlymovingthetaxistothefourspeciallocations.Astheactionsarejointtheycanhandlecollisions.Thejointnavigationsubtasksterminatewhenbothtaxishavereachedtheirdes-ignatedspeciallocations.IfthetaxisareinterpretedasseparateagentsthenthismodelimplicitlyimplementstheconcurrentactionmodelTallterminationscheme[14].Theterminationschemerequiresagentstoterminatesimultane-ouslyandisclearlysuboptimal.However,forcedsynchronyreducesthenumberofsub-MDPsandabstractactionsrequiredinthehierarchy.Sixteensub-MPDsareneededatleveloneforthecollision-freeproblemand12withcollisions,rep-resentingthevariousterminationcombinationsatthefourspeciallocationsforthetwotaxis.

Theleveltwodispatchsubtaskusesstates(p1,p2,d1,d2)andpoliciesfromlevelonegenerating108and144abstractactionsonthetaskwithandwithoutcollisionsrespectively.Theserepresentthecombinationofpickupandputdownactionsuponterminationofthesub-MDPs.Wehaveusedasemi-automaticver-sionoftheHEXQalgorithm[15]forourimplementation.Resultsarelabelled“H-Joint”inTable1andFigure2andshowabouttwoordersofmagnitudesavingsinbothmemoryandlearningtimebutareclearlysuboptimalasex-pected.TheHEXQalgorithmconstructsthetaskhierarchywhilelearningbutitsvariable-wisedecompositiondoesnotproduceamulti-agentdecomposition.Thelargenumberofabstractactionsmakesexecutionexpensive.Thistypeofjointactionhierarchycouldbeabstractedfurtherbyusingsymmetryasdis-cussedinSection4.1butthiswasnotimplemented.

OursecondhierarchicalabstractionisshowninFigure3andbuildsonthemulti-agentabstractionfromtheprevioussection.Inthisthreeleveltaskhier-archythelevelonesubtasksaretheindividualtaxiagentsperformingthewholedeliverytask.WeonlyuseoneQ-tableforbothtaxisasdiscussedinSection

Level 3rootselect dispatchstrategystate = (p1,d2,p1,d2)DispatcherLevel 2state = (p1,d1,p2,d2)abstract actions= (special location, pickup/putdown)taxi 1 –pass.1taxi 2 –pass. 2taxi 1 –pass.2taxi 2 –pass. 1taxi 1 –pass.1taxi 1 –pass. 2taxi 2 –pass.1taxi 2 –pass. 2Level 2dispatchallocationsJoint navigationLevel 1joint actions(deliver t1,px,py),(deliver t2,px,py)state = (t1,t2))state = (t1,t2state = (t1,t2)Level 1Joint pickup and delivery of passengersstate = (t, p, d) orstate = (t1, p, d)(t, p, d, top) with collisionsprimitive actionsprimitive action = (a1,a2)(a)(b)

Fig.3.Dispatcherandjointnavigationtaskhierarchy(H-Joint)isshownin(a);(b)

showsthree-levelhierarchyusingmulti-agenttravellingsalesmandispatcherandwholedeliverytaskatlowerlevel(H-MAS).

4.2.Thesecondleveltasksrepresentthevariousdispatchingoptionsforthetwotaxis.Therearefouroptionsaltogetherallocatingeachtaxitodeliverbothpassengersoroneortheother.Intheeventthatonetaxiistodeliverbothpas-sengers,theleveltwotaskslearnthebestorderofdelivery.Thisproblemisakintoamulti-agenttravellingsalesmanproblem(TSP)inwhicheachcityneedstobevisitedbyoneagentonly.Acityvisitcorrespondstoapassengerpickupanddelivery.Theroottasksimplychoosesoneofthedispatchstrategies.

TheresultaredepictedinFigure2andTable1asH-MASandH-MAS(TOP).Thistaskhierarchygivesnearoptimalresultsforthecollision-freeversionoftheproblem.Itissuboptimalbecause,duetostochasticdrift,itispossiblethattheoptimaldispatchdecisionshouldbeswitchedpriortopickupbutislockedinbythesecondlevelcontrollerinthehierarchy.Weconjecturetobeabletoimprovetheresults,achievingoptimalityinthiscase,byusinghierarchicalgreedyexecution(polling)[16,5].

TheH-MASstructuralabstractionscombinehierarchy,multi-agencyandsymmetryandachievethebestresourceeffectivenesswithandwithoutcolli-sions.Interestingly,storagecanbereducedfurtherbymorethanafactoroffourandlearningtimebyatleastafactoroftwobydecomposingthelevelonesubtaskinFigure3intothreelevelsasfortheoriginaltaxitask[5],creatingafivelevelhierarchicaldecompositionforthetwo-taxitask.4.4

SequentialExecution

Theideabehindsequentialexecutionisto“simulate”theconcurrentactionoftheagentsbytimesharingtheirexecution.Thismeansthattheagentsactoneatatimeasfarasthelearnerisconcerned,butareassumedtobeabletoexecutejointlyinarealworld.

Thisisasemi-MDPhomomorphisminwhichonejointactionismappedtoatwosteptemporallyextendaction.Theadvantageisthatthenumberofactionsisthesumratherthantheproductofthenumberofactionsoftheagents.Ineffectweareassumingthattheproblemcanbeapproximatedwithareducedactionsetwithjointactions(a1,stay)and(stay,a2).

Withandwithoutcollisions,thisreducesthememoryuseandlearningtimebyaboutafactoroffourandachievesnearoptimalresults.Incombinationwithsymmetryweachievethemultiplicativebenefitofbothabstractions.Theseresultsarelabelled“Sequential”and“Seq.+Sym.”inTable1andFigure2.

5AutomaticDiscoveryofAbstractions

Oneofthemajorlessonsfromimplementingtheseabstractions,evenonthisrelativelysimpleexample,isthedifficultyinmanuallyspecifyingthevariousdecompositions.Forexample,thesymmetryabstractionsrequirebothstatesandactionstobemappedtogether,butitisnotstraightforwardtospecifythecompletehomomorphicmappinginSection4.1.Whenevertheuserprovidesanabstractionweareeffectivelyprovidingbackgroundinformationtothelearnerandthismustbemanuallytailoredforeachproblem.Boththeseissuesmotivateautomatingthediscoveryofstructuralabstractions.

Ourapproachtoautomationwouldtrytoexploitanystructureinaproblembysearchingforsymmetry,multi-agentandhierarchicalabstractions.Wemay,forexample,beabletofindsub-modelsthatarewelldefinedhomomorphismscoveringpartsofthewholeproblem.Thesesub-modelsmaybeabletobefoundbysimultaneouslyexploringtheprojectedspacesdefinedbysubsetsofstate-actionvariablesandtestingforMarkovtransitionsandrewards.Thesub-modelsbecomethebuildingblocksforasearchforevenmoreabstractmodelsatahigherlevel.Inthiswayitisenvisagedthatataskhierarchycouldbeshapedfromthebottomuptosolvetheproblemefficiently.

6DiscussionandFutureWork

ThispaperpresentsclassesofabstractionusefulforscalingupRLandpointstowardspromisingfutureworkintwodirections–themanualuseofstructuralabstractions,andtheirautomaticdiscovery.Ourexperimentalresultsshowthattheperformancegainsofsingleabstractionscanbefurtherimprovedthroughtheircombination.Ourbestresultscombinesymmetry,taskhierarchyandmulti-agency.WehavealreadyindicatedaclearopportunityforimprovingtheseresultsinSection4.3andbelievetherearefurthermodelreductionandscalingopportu-nitiesavailable.Themodeloftheotheragent(TOP)inourimplementationusesaglobalpositionvariable.Weconjecturethatitispossibletoparsimoniouslyuseonlya“windowofinfluence”fortheTOPmodel,possiblyadeicticrepresen-tation,thatwouldreducememoryrequirementssignificantly.Further,giventhecorrespondencetomulti-agentTSP,thislogisticproblemshouldscaletractablywithgoodresultsusingmanyvehiclesandcargoinmuchlargersimulations.

Theexpectedgainsfromabstractionsdependonthedegreeofsystemcou-pling.Intheexperimentswefoundthatincreasinginterdependencyreducesthenumberofreachablestatesbecauseoftheaddedconstraintsbutalsocreateslessopportunityforabstraction.

TheexperimentsandforgoingreasoninggivesomehopethatwecanscaleRLtolargerproblemsbyusingmanuallydefinedabstractions.Theunifyingframeworkoftreatingmodelreductionasahomomorphismsuitsandsupportsstructuralabstractionwell.Ourexperimentsalsoindicatethefeasibilityofde-velopingmethodsforautomaticdiscoveryofabstractionsasoutlinedinSection5.Wearecurrentlyexploringbothoftheseapproachesincomplexdomainssuchasbattle-spacesimulations.

References

1.Sutton,R.S.,Barto,A.G.:ReinforcementLearning:AnIntroduction.MITPress,Cambridge,Massachusetts(1998)

2.Clark,A.,Thornton,C.:Tradingspaces:Computation,representation,andthelimitsofuninformedlearning.BehavioralandBrainSciences20(1997)57–663.Utgoff,P.E.,Stracuzzi,D.J.:Many-layeredlearning.In:NeuralComputation,MITPressJournals(2002)

4.Ashby,R.:IntroductiontoCybernetics.Chapman&Hall,London(1956)

5.Dietterich,T.G.:HierarchicalreinforcementlearningwiththeMAXQvaluefunc-tiondecomposition.JournalofArtificialIntelligenceResearch13(2000)227–3036.Watkins,C.J.C.H.:LearningfromDelayedRewards.PhDthesis,King’sCollege(1989)

7.Ravindran,B.,Barto,A.G.:SMDPhomomorphisms:Analgebraicapproachtoabstractioninsemimarkovdecisionprocesses.In:Proc.oftheEighteenthInter-nationalJointConferenceonArtificialIntelligence(IJCAI03).(2003)1011–10188.Ravindran,B.,Barto,A.G.:Modelminimizationinhierarchicalreinforcementlearning.In:FifthSymposiumonAbstraction,ReformulationandApproximation(SARA2002).LNCS,SpringerVerlag(2002)196–211

9.Dean,T.,Givan,R.:Modelminimizationinmarkovdecisionprocesses.In:AAAI/IAAI.(1997)106–111

10.Givan,R.,Leach,S.M.,Dean,T.:Bounded-parametermarkovdecisionprocesses.

ArtificialIntelligence122(2000)71–109

11.Crites,R.H.,Barto,A.G.:Elevatorgroupcontrolusingmultiplereinforcement

learningagents.MachineLearning33(1998)235–262

12.Wolpert,D.,Tumer,K.:Anintroductiontocollectiveintelligence.Technical

ReportNASA-ARC-IC-99-63,NASAAmesResearchCenter,CA(1999)

¨13.Braess,D.:UbereinParadoxonderVerkehrsplanung.Unternehmensforschung12

(1968)258–268

14.Rohanimanesh,K.,Mahadevan,S.:Learningtotakeconcurrentactions.In:NIPS.

(2002)1619–1626

15.Hengst,B.:DiscoveringhierarchyinreinforcementlearningwithHEXQ.InSam-mut,C.,Hoffmann,A.,eds.:ProceedingsoftheNineteenthInternationalConfer-enceonMachineLearning,Morgan-Kaufman(2002)243–250

16.Kaelbling,L.P.:Hierarchicallearninginstochasticdomains:Preliminaryresults.

In:MachineLearningProceedingsoftheTenthInternationalConference,SanMa-teo,CA,MorganKaufmann(1993)167–173

因篇幅问题不能全部显示,请点此查看更多更全内容