Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi

Onderzoeksoutput: Bijdrage aan tijdschriftTijdschriftartikelAcademic

50 Downloads (Pure)

Samenvatting

In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Gradient over multiple random seeds in a simplified environment of the multi-agent cooperative card game. In our analysis of this behavior we look into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In addition, we provide proofs for the maximum length of a perfect game (71 turns) and any game (89 turns). Our code can be found at: https://github.com/bramgrooten/DeepRL-for-Hanabi
Originele taal-2Engels
Artikelnummer2203.11656
Aantal pagina's11
TijdschriftarXiv
Volume2022
DOI's
StatusGepubliceerd - 22 mrt. 2022

Bibliografische nota

Accepted at ALA 2022 (Adaptive and Learning Agents Workshop at AAMAS 2022)

Trefwoorden

  • cs.LG
  • cs.AI
  • cs.MA

Vingerafdruk

Duik in de onderzoeksthema's van 'Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi'. Samen vormen ze een unieke vingerafdruk.

Citeer dit