Load-balancing distributed outer joins through operator decomposition

Long Cheng (Corresponding author), Spyros Kotoulas, Qingzhi Liu, Ying Wang

Onderzoeksoutput: Bijdrage aan tijdschriftTijdschriftartikelAcademicpeer review

Uittreksel

High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches.

TaalEngels
Pagina's21-35
Aantal pagina's15
TijdschriftJournal of Parallel and Distributed Computing
Volume132
DOI's
StatusGepubliceerd - 1 okt 2019

Vingerafdruk

Load Balancing
Resource allocation
Join
Decomposition
Data storage equipment
Decompose
Operator
Skew
Partial
Load Balance
Distributed Environment
Distributed Memory
Experimental Evaluation
High Performance

Trefwoorden

    Citeer dit

    @article{86b14287dfdf43c688c3d35a65592786,
    title = "Load-balancing distributed outer joins through operator decomposition",
    abstract = "High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches.",
    keywords = "Data skew, Distributed join, Load balancing, Outer join, Spark",
    author = "Long Cheng and Spyros Kotoulas and Qingzhi Liu and Ying Wang",
    year = "2019",
    month = "10",
    day = "1",
    doi = "10.1016/j.jpdc.2019.05.008",
    language = "English",
    volume = "132",
    pages = "21--35",
    journal = "Journal of Parallel and Distributed Computing",
    issn = "0743-7315",
    publisher = "Academic Press Inc.",

    }

    Load-balancing distributed outer joins through operator decomposition. / Cheng, Long (Corresponding author); Kotoulas, Spyros; Liu, Qingzhi; Wang, Ying.

    In: Journal of Parallel and Distributed Computing, Vol. 132, 01.10.2019, blz. 21-35.

    Onderzoeksoutput: Bijdrage aan tijdschriftTijdschriftartikelAcademicpeer review

    TY - JOUR

    T1 - Load-balancing distributed outer joins through operator decomposition

    AU - Cheng,Long

    AU - Kotoulas,Spyros

    AU - Liu,Qingzhi

    AU - Wang,Ying

    PY - 2019/10/1

    Y1 - 2019/10/1

    N2 - High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches.

    AB - High-performance data analytics largely relies on being able to efficiently execute various distributed data operators such as distributed joins. So far, large amounts of join methods have been proposed and evaluated in parallel and distributed environments. However, most of them focus on inner joins, and there is little published work providing the detailed implementations and analysis of outer joins. In this work, we present POPI (Partial Outer join & Partial Inner join), a novel method to load-balance large parallel outer joins by decomposing them into two operations: a large outer join over data that does not present significant skew in the input and an inner join over data presenting significant skew. We present the detailed implementation of our approach and show that POPI is implementable over a variety of architectures and underlying join implementations. Moreover, our experimental evaluation over a distributed memory platform also demonstrates that the proposed method is able to improve outer join performance under varying data skew and present excellent load-balancing properties, compared to current approaches.

    KW - Data skew

    KW - Distributed join

    KW - Load balancing

    KW - Outer join

    KW - Spark

    UR - http://www.scopus.com/inward/record.url?scp=85066853800&partnerID=8YFLogxK

    U2 - 10.1016/j.jpdc.2019.05.008

    DO - 10.1016/j.jpdc.2019.05.008

    M3 - Article

    VL - 132

    SP - 21

    EP - 35

    JO - Journal of Parallel and Distributed Computing

    T2 - Journal of Parallel and Distributed Computing

    JF - Journal of Parallel and Distributed Computing

    SN - 0743-7315

    ER -