Annealing-pareto multi-objective multi-armed bandit algorithm

S.Q. Yahyaa, M.M. Drugan, B. Manderick

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

15 Citations (Scopus)

Abstract

In the stochastic multi-objective multi-armed bandit (or MOMAB), arms generate a vector of stochastic rewards, one per objective, instead of a single scalar reward. As a result, there is not only one optimal arm, but there is a set of optimal arms (Pareto front) of reward vectors using the Pareto dominance relation and there is a trade-off between finding the optimal arm set (exploration) and selecting fairly or evenly the optimal arms (exploitation). To trade-off between exploration and exploitation, either Pareto knowledge gradient (or Pareto-KG for short), or Pareto upper confidence bound (or Pareto-UCB1 for short) can be used. They combine the KG-policy and UCB1-policy respectively with the Pareto dominance relation. In this paper, we propose Pareto Thompson sampling that uses Pareto dominance relation to find the Pareto front. We also propose annealing-Pareto algorithm that trades-off between the exploration and exploitation by using a decaying parameter t in combination with Pareto dominance relation. The annealing-Pareto algorithm uses the decaying parameter to explore the Pareto optimal arms and uses Pareto dominance relation to exploit the Pareto front. We experimentally compare Pareto-KG, Pareto-UCB1, Pareto Thompson sampling and the annealing-Pareto algorithms on multi-objective Bernoulli distribution problems and we conclude that the annealing-Pareto is the best performing algorithm.

Original languageEnglish
Title of host publicationIEEE SSCI 2014 - 2014 IEEE Symposium Series on Computational Intelligence - ADPRL 2014: 2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 9-12 December 2014, Orlando, Florida
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages1-3
ISBN (Print)9781479945535
DOIs
Publication statusPublished - 14 Jan 2015
Externally publishedYes
Event2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2014) - Orlando, United States
Duration: 9 Dec 201412 Dec 2014

Conference

Conference2014 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2014)
Abbreviated titleADPRL 2014
Country/TerritoryUnited States
CityOrlando
Period9/12/1412/12/14

Fingerprint

Dive into the research topics of 'Annealing-pareto multi-objective multi-armed bandit algorithm'. Together they form a unique fingerprint.

Cite this