Efficient data redistribution to speedup big data analytics in large systems

  • L. Cheng
  • , T. Li

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

21 Citations (Scopus)

Abstract

The performance of parallel data analytics systems becomes increasingly important with the rise of Big Data. An essential operation in such environment is parallel join, which always incurs significant cost on network communication. State-of-the-art approaches have achieved performance improvements over conventional implementations through minimizing network traffic or communication time. However, these approaches still face performance issues in the presence of big data and/or large-scale systems, due to their heavy overhead of data redistribution scheduling. In this paper, we propose near-join, a network-aware redistribution approach targeting to efficiently reduce both network traffic and communication time of join executions. Particularly, near-join is lightweight and adaptable to processing large datasets over large systems. We present the details of our algorithm and its implementation. The experiments performed on a cluster of up to 400 nodes and datasets of about 100GB have demonstrated that our scheduling algorithm is much faster than the state-of-the-art methods. Moreover, our join implementation can also achieve speedups over the conventional approaches.
Original languageEnglish
Title of host publicationProceedings of the 23rd IEEE International Conference on High Performance Computing, 19-22 December 2016, Hyderabad, India
Place of PublicationPiscataway
PublisherInstitute of Electrical and Electronics Engineers
Pages91-100
Number of pages10
ISBN (Electronic)9781509054114
DOIs
Publication statusPublished - 1 Feb 2017
EventThe 23rd IEEE International Conference on High Performance Computing - HICC, Hyderabad, India
Duration: 19 Dec 201622 Dec 2016
http://www.hipc.org/hipc2016/index.php

Conference

ConferenceThe 23rd IEEE International Conference on High Performance Computing
Country/TerritoryIndia
CityHyderabad
Period19/12/1622/12/16
Internet address

Keywords

  • data analytics
  • data locality
  • data-intensive computing
  • high performance computing
  • parallel joins

Fingerprint

Dive into the research topics of 'Efficient data redistribution to speedup big data analytics in large systems'. Together they form a unique fingerprint.

Cite this