Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi

Bram Grooten (Corresponding author), Jelle Wemmenhove, Maurice Poot, Jim Portegies

Research output: Contribution to journalArticleAcademic

96 Downloads (Pure)

Abstract

In pursuit of enhanced multi-agent collaboration, we analyze several on-policy deep reinforcement learning algorithms in the recently published Hanabi benchmark. Our research suggests a perhaps counter-intuitive finding, where Proximal Policy Optimization (PPO) is outperformed by Vanilla Policy Gradient over multiple random seeds in a simplified environment of the multi-agent cooperative card game. In our analysis of this behavior we look into Hanabi-specific metrics and hypothesize a reason for PPO's plateau. In addition, we provide proofs for the maximum length of a perfect game (71 turns) and any game (89 turns). Our code can be found at: https://github.com/bramgrooten/DeepRL-for-Hanabi
Original languageEnglish
Article number2203.11656
Number of pages11
JournalarXiv
Volume2022
DOIs
Publication statusPublished - 22 Mar 2022

Bibliographical note

Accepted at ALA 2022 (Adaptive and Learning Agents Workshop at AAMAS 2022)

Fingerprint

Dive into the research topics of 'Is Vanilla Policy Gradient Overlooked? Analyzing Deep Reinforcement Learning for Hanabi'. Together they form a unique fingerprint.

Cite this