Multi-agent RMax for Multi-Agent Multi-Armed Bandits


In this paper, we provide PAC bounds for best-arm identification in multi-agent multi-armed bandits (MAMABs), via an algorithm we call multi-agent RMax (MARMax). In a MAMAB, the reward structure is expressed as a coordination graph, i.e., the total team reward is a sum over local reward functions that are conditioned on the actions of a subset of the agents. Assuming that these local rewards functions are bounded, we derive a PAC($\delta, \varepsilon$) learning algorithm that achieves an accuracy of $\varepsilon$ relative to the maximum team reward, with a number of required samples at most linear in the size of the largest reward reward function with probability $1-\delta$. Furthermore, we show that in practice MARMax performs much better than this theoretical upper bound on the sample complexity by an order of magnitude. We further improve on the empirical result by tempering the optimism used in MARMax, resulting in a new algorithm that we call MAVMax. We show empirically that MAVMax further cuts down the number of required samples by around 30% w.r.t. MARMax.

At the Adaptive Learning Agents Workshop of the 21st International Conference on Autonomous Agents and Multiagent Systems.
Raphael Avalos
Raphael Avalos
PhD Candidate

PhD candidate in Multi-Agent Reinforcement Learning.