International Journal of Advanced Computer Research (IJACR) ISSN (Print): 2249-7277 ISSN (Online): 2277-7970 Volume - 10 Issue - 48 May - 2020
  1. 1
    Google Scholar
Implementation and performance analysis of dynamic partitioning of graphs in Apache Spark

Geetha J, Jayalakshmi D S and Harshit N G

Abstract

In recent years data is growing continuously at an exponential rate. Even in its processed form, large data is difficult to understand and analyze. One of the best approaches to handle large data is to represent it in the graphic format. These graphs can be used for data analysis, processing and decision making. Because of the volume of data is huge, the graphs generated by them are also huge making it difficult to process. Thus, the modular approach of studying them by partitioning them is much more effective. There are several inbuilt partitioning techniques available in Apache Spark which can be extended to graph partitioning according to the user needs. Graph partitioning is an active area of research with considerable activity with an aim towards increasing the accuracy and speed of the algorithms. This research work aims to build a high- performance graph partitioning technique for Apache Spark which makes it faster, scalable and efficient. In this paper, a custom algorithm that can divide the graph into an optimal number of partitions dynamically in Apache Spark is proposed. A novel method to calculate the distance between the nodes and similarity indexes to partition the data is introduced. The optimal number of partitions is decided to use similarity indexes, which are calculated using the concept of Laplacian matrix and Eigenvalues. The proposed algorithm is implemented and its performance is compared to the existing algorithms in Apache Spark. The results indicate that the algorithm partitions graphs which have a huge number of vertices with considerable efficiency and computing cost.

Keyword

Partition, Graphs, RDD, Clustering, Spark.

Cite this article

Geetha J, JayalakshmiHarshit.Implementation and performance analysis of dynamic partitioning of graphs in Apache Spark. International Journal of Advanced Computer Research. 2020;10(48):116-127. DOI:10.19101/IJACR.2020.1048023

Refference

[1]Mokashi VS, Kulkarni DB. A review: scalable parallel graph partitioning for complex networks. In second international conference on intelligent computing and control systems 2018 (pp. 1869-71). IEEE.

[2]Hassan M, Bansal SK. Data partitioning scheme for efficient distributed RDF querying using apache spark. In international conference on semantic computing 2019 (pp. 24-31). IEEE.

[3]Gounaris A, Kougka G, Tous R, Montes CT, Torres J. Dynamic configuration of partitioning in spark applications. IEEE Transactions on Parallel and Distributed Systems. 2017; 28(7):1891-904.

[4]Bertolucci M, Carlini E, Dazzi P, Lulli A, Ricci L. Static and dynamic big data partitioning on apache spark. In PARCO 2015 (pp. 489-98).

[5]Abughofa T, Zulkernine F. Towards online graph processing with spark streaming. In international conference on big data 2017 (pp. 2787-94). IEEE.

[6]Taloba AI, Riad MR, Soliman TH. Developing an efficient spectral clustering algorithm on large scale graphs in spark. In international conference on intelligent computing and information systems 2017 (pp. 292-8). IEEE.

[7]Atashkar AH, Ghadiri N, Joodaki M. Linked data partitioning for RDF processing on Apache Spark. In international conference on web research 2017 (pp. 73-7). IEEE.

[8]Tian X, Guo Y, Zhan J, Wang L. Towards memory and computation efficient graph processing on spark. In international conference on big data 2017 (pp. 375-82). IEEE.

[9]Rajan AK, Bhaiya D. Accelerated kerninghan lin algorithm for graph partitioning. In international conference on advances in computing, communications and informatics 2017 (pp. 174-8). IEEE.

[10]Wang M, Yang W, Li H, Lin Y, Chen J. Metis-CIC: a new mesh partitioning heuristic for parallel preconditioned iterative methods in CFD. In international conference on high performance computing & simulation 2016 (pp. 188-95). IEEE.

[11]Kyong J, Jeon J, Lim SS. Improving scalability of apache spark-based scale-up server through docker container-based partitioning. In proceedings of the international conference on software and computer applications 2017 (pp. 176-80).

[12]Olukanmi PO, Twala B. K-means-sharp: modified centroid update for outlier-robust k-means clustering. In pattern recognition association of south Africa and robotics and mechatronics 2017 (pp. 14-9). IEEE.