Samenvatting
We consider a special case of bandit problems, named batched bandits, in which an agent observes batches of responses over a certain time period. Unlike previous work, we consider a more practically relevant batch-centric scenario of batch learning. That is to say, we provide a policy-agnostic regret analysis and demonstrate upper and lower bounds for the regret of a candidate policy. Our main theoretical results show that the impact of batch learning is a multiplicative factor of batch size relative to the regret of online behavior. Primarily, we study two settings of the stochastic linear bandits: bandits with finitely and infinitely many arms. While the regret bounds are the same for both settings, the former setting results hold under milder assumptions. Also, we provide a more robust result for the 2-armed bandit problem as an important insight. Finally, we demonstrate the consistency of theoretical results by conducting empirical experiments and reflect on optimal batch size choice.
Originele taal-2 | Engels |
---|---|
Titel | Proceedings - 22nd IEEE International Conference on Data Mining, ICDM 2022 |
Redacteuren | Xingquan Zhu, Sanjay Ranka, My T. Thai, Takashi Washio, Xindong Wu |
Pagina's | 1149-1154 |
Aantal pagina's | 6 |
ISBN van elektronische versie | 9781665450997 |
DOI's | |
Status | Gepubliceerd - 2022 |