TY - JOUR
T1 - Improving the robustness and performance of parallel joins over distributed systems
AU - Cheng, L.
AU - Kotoulas, S.
AU - Ward, T.E.
AU - Theodoropoulos, G.
PY - 2017/11
Y1 - 2017/11
N2 - High-performance data processing systems typically utilize numerous servers with large amounts of memory. An essential operation in such environment is the parallel join, the performance of which is critical for data intensive operations. In many real-world workloads, data skew is omnipresent. Techniques that do not cater for the possibility of data skew often suffer from performance failures and memory problems. State-of-the-art methods designed to handle data skew propose new ways to distribute computation that avoid hotspots. However, this comes at the expense of global collection of statistics, redundant computation, duplication of data or increased network communication. In this light, performance could be further improved by removing the dependency on global skew knowledge and broadcasting. In this paper, we propose a new method called PRPQ (partial redistribution & partial query), with targets for efficient and robust joins with large datasets over high performance clusters. We present the detailed implementation of our approach and compare its performance with current implementations. The experimental results demonstrate that the proposed algorithm is scalable and robust and can also outperform the state-of-the-art approach with less network communication, figures that confirm our theoretical analysis.
AB - High-performance data processing systems typically utilize numerous servers with large amounts of memory. An essential operation in such environment is the parallel join, the performance of which is critical for data intensive operations. In many real-world workloads, data skew is omnipresent. Techniques that do not cater for the possibility of data skew often suffer from performance failures and memory problems. State-of-the-art methods designed to handle data skew propose new ways to distribute computation that avoid hotspots. However, this comes at the expense of global collection of statistics, redundant computation, duplication of data or increased network communication. In this light, performance could be further improved by removing the dependency on global skew knowledge and broadcasting. In this paper, we propose a new method called PRPQ (partial redistribution & partial query), with targets for efficient and robust joins with large datasets over high performance clusters. We present the detailed implementation of our approach and compare its performance with current implementations. The experimental results demonstrate that the proposed algorithm is scalable and robust and can also outperform the state-of-the-art approach with less network communication, figures that confirm our theoretical analysis.
KW - Data skew
KW - High performance computing
KW - Parallel joins
KW - Robust
UR - http://www.scopus.com/inward/record.url?scp=85024925710&partnerID=8YFLogxK
U2 - 10.1016/j.jpdc.2017.06.016
DO - 10.1016/j.jpdc.2017.06.016
M3 - Article
SN - 0743-7315
VL - 109
SP - 310
EP - 323
JO - Journal of Parallel and Distributed Computing
JF - Journal of Parallel and Distributed Computing
ER -