The uncoordinated spectrum access problem is studied using a multi-player multi-armed bandits framework. We consider a decentralized multi-player stochastic multi-armed bandit model where the players cannot communicate with each other and can observe only their own actions and rewards. Furthermore, the environment may appear differently to different players, i.e., the reward distributions for a given arm may vary across players. Knowledge of time horizon T is not assumed. Under these conditions, we consider two settings - zero and non-zero reward on collision (when more than one player plays the same arm). Under the zero reward on collision setting, we present a policy that achieves expected regret of O(log T) over a time horizon of duration T. While settings with non-zero rewards on collisions and varying reward distributions of arms across players have been considered separately in prior work, a model allowing for both has not been studied previously to the best of our knowledge. With this setup, we present a policy that achieves expected regret of order O(log^{2 + \delta} T) for some 0 < \delta < 1 over a time horizon of duration T.
【 预 览 】
附件列表
Files
Size
Format
View
Decentralized multi-user multi-armed bandits with user dependent reward distributions