Load-balancing distributed outer joins through operator decomposition

Long Cheng (Corresponding author), Spyros Kotoulas, Qingzhi Liu, Ying Wang

Research output: Contribution to journalArticleAcademicpeer-review

3 Citations (Scopus)

Abstract

High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches.

Original languageEnglish
Pages (from-to)21-35
Number of pages15
JournalJournal of Parallel and Distributed Computing
Volume132
DOIs
Publication statusPublished - 1 Oct 2019

Funding

Part of this work is supported by the European Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 799066 . The authors also thank the reviewers for their careful reading of the manuscript and many insightful comments and suggestions, which have greatly helped in improving the work. Long Cheng is a Marie Skłodowska-Curie Fellow at University College Dublin. He received a B.E. from Harbin Institute of Technology, China in 2007, a M.Sc from Universität Duisburg–Essen, Germany in 2010 and a Ph.D. from National University of Ireland Maynooth in 2014. He has worked at organizations such as Huawei Technologies Germany, IBM Research Ireland, TU Dresden and TU Eindhoven. His research interests mainly include Parallel and distributed computing, Big data, Semantic web and Process mining. Spyros Kotoulas is a Research Scientist at IBM Research Ireland and the Manager of the Health and Person-Centered Systems research group. His research interests lie in artificial intelligence and data management for semi-structured data, parallel methods for data intensive processing, knowledge representation and reasoning, flexible data integration methods and the Internet of Things. The application domain of his research is Health and Social Care. Spyros has received his PhD from Vrije Universiteit Amsterdam. Qingzhi Liu is a Post-Doc researcher at Eindhoven University of Technology (TU/e). He received the B.S. degree in Telecommunication and M.Eng. degree in Software Engineering from Xidian University, China, in 2005 and 2008 respectively. He received the M.S. degree (with Honor) and the Ph.D. degree in Computer Science from Delft University of Technology, Netherlands, in 2011 and 2016 respectively. His research interests include Internet of Things, wireless power transfer networks, and distributed wireless network. Ying Wang is an Associate Professor at Institute of Computing Technology (ICT), Chinese Academy of Sciences (CAS). He received the B.S. and M.S. degrees in Electrical Engineering from Harbin Institute of Technology, in 2007 and 2009 respectively, and the Ph.D. degree of computer science from ICT, CAS, Beijing, in 2014. His research interests includes computer architecture and VLSI design, specifically memory system, on-chip interconnects, resilient and energy-efficient architecture, and machine learning accelerators.

Keywords

  • Data skew
  • Distributed join
  • Load balancing
  • Outer join
  • Spark

Fingerprint

Dive into the research topics of 'Load-balancing distributed outer joins through operator decomposition'. Together they form a unique fingerprint.

Cite this