Multi-agent rmax for multi-agent multi-armed bandits

Jan 1, 2022·

Eugenio Bargiacchi

Raphael Avalos

Timothy Verstraeten

Pieter Libin

Ann Nowé

Diederik M Roijers

· 0 min read

PDF Cite

Abstract

In this paper, we provide PAC bounds for best-arm identificationin multi-agent multi-armed bandits (MAMABs), via an algorithmwe call multi-agent RMax (MARMax). In a MAMAB, the rewardstructure is expressed as a coordination graph, i.e., the total teamreward is a sum over local reward functions that are conditionedon the actions of a subset of the agents. Assuming that these localrewards functions are bounded, we derive a PAC(𝛿, 𝜀) learningalgorithm that achieves an accuracy of 𝜀 relative to the maximumteam reward, with a number of required samples at most linear inthe size of the largest reward reward function with probability 1 −𝛿.Furthermore, we show that in practice MARMax performs muchbetter than this theoretical upper bound on the sample complexityby an order of magnitude. We further improve on the empiricalresult by tempering the optimism used in MARMax, resulting ina new algorithm that we call MAVMax. We show empirically thatMAVMax further cuts down the number of required samples byaround 30% w.r.t. MARMax.

Type

Journal article

Publication

Proc. of Adaptive and Learning Agents Workshop (ALA) at AAMAS