Exploration versus exploitation trade-off in infinite horizon Pareto multi-armed bandits algorithms

M.M. Drugan, B. Manderick

Onderzoeksoutput: Hoofdstuk in Boek/Rapport/CongresprocedureConferentiebijdrageAcademicpeer review

4 Citaten (Scopus)
1 Downloads (Pure)

Samenvatting

Multi-objective multi-armed bandits (MOMAB) are multi-armed bandits (MAB) extended to reward vectors. We use the Pareto dominance relation to assess the quality of reward vectors, as opposite to scalarization functions. In this paper, we study the exploration vs exploitation trade-off in infinite horizon MOMABs algorithms. Single objective MABs explore the suboptimal arms and exploit a single optimal arm. MOMABs explore the suboptimal arms, but they also need to exploit fairly all optimal arms. We study the exploration vs exploitation trade-off of the Pareto UCB1 algorithm. We extend UCB2 that is another popular infinite horizon MAB algorithm to rewards vectors using the Pareto dominance relation. We analyse the properties of the proposed MOMAB algorithms in terms of upper regret bounds. We experimentally compare the exploration vs exploitation trade-off of the proposed MOMAB algorithms on a bi-objective Bernoulli environment coming from control theory.

Originele taal-2Engels
TitelProceedings of the International Conference on Agents and Artificial Intelligence : Lisbon, Portugal, 10-12 January 2015
Plaats van producties.l.
UitgeverijSciTePress Digital Library
Pagina's66-77
Aantal pagina's12
Volume2
ISBN van geprinte versie9789897580741
StatusGepubliceerd - 2015
Extern gepubliceerdJa
Evenement7th International Conference on Agents and Artificial Intelligence (ICAART 2015) - Lisbon, Portugal
Duur: 10 jan. 201512 jan. 2015
Congresnummer: 7
http://www.icaart.org/?y=2015

Congres

Congres7th International Conference on Agents and Artificial Intelligence (ICAART 2015)
Verkorte titelICAART 2015
Land/RegioPortugal
StadLisbon
Periode10/01/1512/01/15
Internet adres

Vingerafdruk

Duik in de onderzoeksthema's van 'Exploration versus exploitation trade-off in infinite horizon Pareto multi-armed bandits algorithms'. Samen vormen ze een unieke vingerafdruk.

Citeer dit