Abstract
Big data analytics platforms have played a critical role in the unprecedented success of data-driven applications. However, real-time and streaming data applications, and recent legislation, e.g., GDPR in Europe, have posed constraints on exchanging and analyzing data, especially personal data, across geographic regions. To address such constraints data has to be processed and analyzed in-situ and aggregated results have to be exchanged among the different sites for further processing. This introduces additional network delays due to the geographic distribution of the sites and potentially affecting the performance of analytics platforms that are designed to operate in datacenters with low network delays. In this paper, we show that the three most popular big data analytics systems (Apache Storm, Apache Spark, and Apache Flink) fail to tolerate round-trip times more than 30 milliseconds even when the input data rate is low. The execution time of distributed big data analytics tasks degrades substantially after this threshold, and some of the systems are more sensitive than others. A closer examination and understanding of the design of these systems show that there is no winner in all wide-area settings. However, we show that it is possible to improve the performance of all these popular big data analytics systems significantly amid even transcontinental delays (where inter-node delay is more than 30 milliseconds) and achieve performance comparable to this within a datacenter for the same load.
Original language | English |
---|---|
Article number | 9833909 |
Pages (from-to) | 4734 - 4749 |
Number of pages | 16 |
Journal | IEEE Transactions on Network and Service Management |
Volume | 19 |
Issue number | 4 |
DOIs | |
Publication status | Published - Dec 2022 |
Keywords
- Delays
- Wide area networks
- Bandwidth
- Task analysis
- Big Data
- Distributed databases
- Data analysis
- geo-distributed systems
- big data analytics
- Wide-area analytics
- networked systems
Fingerprint
Dive into the research topics of 'Delay-Resistant Geo-Distributed Analytics'. Together they form a unique fingerprint.Press/Media
-
Findings from Technical University Berlin (TU Berlin) Yields New Data on Data Analytics (Delay-resistant Geo-distributed Analytics)
24/03/23
1 item of Media coverage
Press/Media: Expert Comment