Multi-agent rmax for multi-agent multi-armed bandits

Jan 1, 2022ยท
Eugenio Bargiacchi
Raphael Avalos
Raphael Avalos
,
Timothy Verstraeten
,
Pieter Libin
,
Ann Nowรฉ
,
Diederik M Roijers
ยท 0 min read
Abstract
In this paper, we provide PAC bounds for best-arm identificationin multi-agent multi-armed bandits (MAMABs), via an algorithmwe call multi-agent RMax (MARMax). In a MAMAB, the rewardstructure is expressed as a coordination graph, i.e., the total teamreward is a sum over local reward functions that are conditionedon the actions of a subset of the agents. Assuming that these localrewards functions are bounded, we derive a PAC(๐›ฟ, ๐œ€) learningalgorithm that achieves an accuracy of ๐œ€ relative to the maximumteam reward, with a number of required samples at most linear inthe size of the largest reward reward function with probability 1 โˆ’๐›ฟ.Furthermore, we show that in practice MARMax performs muchbetter than this theoretical upper bound on the sample complexityby an order of magnitude. We further improve on the empiricalresult by tempering the optimism used in MARMax, resulting ina new algorithm that we call MAVMax. We show empirically thatMAVMax further cuts down the number of required samples byaround 30% w.r.t. MARMax.
Type
Publication
Proc. of Adaptive and Learning Agents Workshop (ALA) at AAMAS