International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (P): 2394-5443 ISSN (O): 2394-7454 Vol - 8, Issue - 82, September 2021
  1. 1
    Google Scholar
Factors affecting cloud data-center efficiency: a scheduling algorithm-based analysis

Arif Ahmad Shehloo, Muheet Ahmed Butt and Majid Zaman

Abstract

Cloud computing encompasses two massively scalable services: computing capability and data storage space, which are provided by a massive number of machines and clusters. The increased use of big data has resulted in adopting a wide range of analytics engines, such as Hadoop. As a result, Hadoop has gained widespread acceptance as a data analytics platform. Over the past decade, Hadoop's ability to schedule tasks has become a critical aspect of system performance. Numerous researchers have presented various scheduling methods in their work to address the complex issue of performance degradation. However, few studies have been conducted to date to evaluate the effectiveness of these methods. By employing the PRISMA approach for searching and selecting papers, we examine the design choices that went into various Hadoop scheduling techniques proposed between 2008 and 2021. We present a taxonomy for succinctly categorising these scheduling techniques. Additionally, we evaluate methodologies based on a variety of performance metrics. Our search identified 82 studies relevant to this domain, all of which came from high-quality conferences, journals, symposiums, and workshops. This systematic study discusses various dynamic, constrained, and adaptive scheduling methods and their primary motivations, including makespan, data control, deadline, resource utilisation, load balancing, fairness, energy efficiency, and failure recovery. There is also a discussion of some unresolved issues and potential future directions for modifying existing studies. This study conducts a systematic review of the literature to identify and discuss the most critical factors affecting Hadoop scheduler performance and provide a roadmap for researchers working in this field. Finally, we intend to expand on the qualitative analysis conducted thus far and give the experts additional recommendations to conduct future cloud scheduling research.

Keyword

Big data, Cloud computing, Apache Hadoop, MapReduce, Task scheduling.

Cite this article

Shehloo AA, Butt MA, Zaman M

Refference

[1][1]https://developer.ibm.com/articles/os-hadoop-scheduling/. Accessed 01 June 2021.

[2][2]Maheshwari A, Bhardwaj A, Chandrasekaran K. Hadoop task scheduling-Improving algorithms using tabular approach. In fifth international conference on communication systems and network technologies 2015 (pp. 1034-8). IEEE.

[3][3]http://hadoop.apache.org/. Accessed 01 June 2021.

[4][4]Anagnostopoulos I, Zeadally S, Exposito E. Handling big data: research challenges and future directions. The Journal of Supercomputing. 2016; 72(4):1494-516.

[5][5]Singh N, Agrawal S. A review of research on MapReduce scheduling algorithms in Hadoop. In international conference on computing, communication & automation 2015 (pp. 637-42). IEEE.

[6][6]Rao BT, Reddy LS. Survey on improved scheduling in Hadoop MapReduce in cloud environments. arXiv preprint arXiv:1207.0780. 2012.

[7][7]Patil S, Deshmukh S. Survey on task assignment techniques in Hadoop. International Journal of Computer Applications. 2012; 59(14):15-18.

[8][8]Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021:1-9.

[9][9]Kalia K, Gupta N. A Review on job scheduling for hadoop mapreduce. In international conference on next generation computing and information systems 2017 (pp. 75-9). IEEE.

[10][10]Rasooli A, Down DG. An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems. In proceedings of the conference of the center for advanced studies on collaborative research 2011 (pp. 30-44). IBM Corp.

[11][11]Tian W, Luo G, Tian L, Chen A. On dynamic job ordering and slot configurations for minimizing the makespan of multiple MapReduce jobs. arXiv preprint arXiv:1604.04471. 2016.

[12][12]Cheng D, Zhou X, Xu Y, Liu L, Jiang C. Deadline-aware MapReduce job scheduling with dynamic resource availability. IEEE Transactions on Parallel and Distributed Systems. 2018; 30(4):814-26.

[13][13]Kc K, Anyanwu K. Scheduling hadoop jobs to meet deadlines. In second international conference on cloud computing technology and science 2010 (pp. 388-92). IEEE.

[14][14]Zaharia M, Konwinski A, Joseph AD, Katz RH, Stoica I. Improving MapReduce performance in heterogeneous environments. In USENIX symposium on operating systems design and implementation 2008(pp.29-42).

[15][15]Tan J, Meng X, Zhang L. Coupling task progress for mapreduce resource-aware scheduling. In proceedings IEEE INFOCOM 2013 (pp. 1618-26). IEEE.

[16][16]Tian W, Li G, Yang W, Buyya R. HScheduler: an optimal approach to minimize the makespan of multiple MapReduce jobs. The Journal of Supercomputing. 2016; 72(6):2376-93.

[17][17]Jiang Y, Zhu Y, Wu W, Li D. Makespan minimization for MapReduce systems with different servers. Future Generation Computer Systems. 2017; 67:13-21.

[18][18]Gandomi A, Movaghar A, Reshadi M, Khademzadeh A. Designing a MapReduce performance model in distributed heterogeneous platforms based on benchmarking approach. The Journal of Supercomputing. 2020:1-27.

[19][19]Xu J, Wang J, Qi Q, Liao J, Sun H, Han Z, Li T. Network-aware task selection to reduce multi-application makespan in cloud. Journal of Network and Computer Applications. 2021;176(15).

[20][20]Zaharia M, Borthakur D, Sen Sarma J, Elmeleegy K, Shenker S, Stoica I. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In proceedings of the European conference on computer systems 2010(pp. 265-78).

[21][21]Naik NS, Negi A, BR TB, Anitha R. A data locality based scheduler to enhance MapReduce performance in heterogeneous environments. Future Generation Computer Systems. 2019; 90:423-34.

[22][22]Chen TY, Wei HW, Wei MF, Chen YJ, Hsu TS, Shih WK. LaSA: A locality-aware scheduling algorithm for Hadoop-MapReduce resource assignment. In international conference on collaboration technologies and systems 2013 (pp. 342-6). IEEE.

[23][23]Althebyan Q, ALQudah O, Jararweh Y, Yaseen Q. Multi-threading based map reduce tasks scheduling. In international conference on information and communication systems 2014 (pp. 1-6). IEEE.

[24][24]Xu Y, Cai W. Hadoop job scheduling with dynamic task splitting. In international conference on cloud computing research and innovation 2015 (pp. 120-9). IEEE.

[25][25]Kao YC, Chen YS. Data-locality-aware mapreduce real-time scheduling framework. Journal of Systems and Software. 2016; 112:65-77.

[26][26]Dai X, Bensaou B. Scheduling for response time in Hadoop MapReduce. In international conference on communications 2016 (pp. 1-6). IEEE.

[27][27]Xie Q, Pundir M, Lu Y, Abad CL, Campbell RH. Pandas: robust locality-aware scheduling with stochastic delay optimality. IEEE/ACM Transactions on Networking. 2016; 25(2):662-75.

[28][28]Seo S, Jang I, Woo K, Kim I, Kim JS, Maeng S. HPMR: Prefetching and pre-shuffling in shared MapReduce computation environment. In international conference on cluster computing and workshops 2009 (pp. 1-8). IEEE.

[29][29]Wang C, Wu Q, Tan Y, Wang W, Wu Q. Locality based data partitioning in MapReduce. In international conference on computational science and engineering 2013 (pp. 1310-7). IEEE.

[30][30]Wang W, Zhu K, Ying L, Tan J, Zhang L. Maptask scheduling in mapreduce with data locality: Throughput and heavy-traffic optimality. IEEE/ACM Transactions on Networking. 2014; 24(1):190-203.

[31][31]Zhang X, Zhong Z, Feng S, Tu B, Fan J. Improving data locality of MapReduce by scheduling in homogeneous computing environments. In international symposium on parallel and distributed processing with applications 2011 (pp. 120-6). IEEE.

[32][32]Polo J, Becerra Y, Carrera D, Steinder M, Whalley I, Torres J, et al. Deadline-based MapReduce workload management. IEEE Transactions on Network and Service Management. 2013; 10(2):231-44.

[33][33]He C, Lu Y, Swanson D. Matchmaking: A new MapReduce scheduling technique. In IEEE third international conference on cloud computing technology and science 2011 (pp. 40-7). IEEE.

[34][34]Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S. Maestro: Replica-aware map scheduling for mapreduce. In IEEE/ACM international symposium on cluster, cloud and grid computing 2012 (pp. 435-42). IEEE.

[35][35]Asahara M, Nakadai S, Araki T. LoadAtomizer: a locality and I/O load aware task scheduler for MapReduce. In international conference on cloud computing technology and science proceedings 2012 (pp. 317-24). IEEE.

[36][36]Singh G, Sharma A, Jeyaraj R, Paul A. Handling non-local executions to improve MapReduce performance using ant colony optimization. IEEE Access. 2021; 9:96176-88.

[37][37]Hammoud M, Sakr MF. Locality-aware reduce task scheduling for MapReduce. In IEEE third international conference on cloud computing technology and science 2011 (pp. 570-6). IEEE.

[38][38]Hammoud M, Rehman MS, Sakr MF. Center-of-gravity reduce task scheduling to lower mapreduce network traffic. In fifth international conference on cloud computing 2012 (pp. 49-58). IEEE.

[39][39]Tan J, Meng S, Meng X, Zhang L. Improving reducetask data locality for sequential mapreduce jobs. In Proceedings IEEE INFOCOM 2013 (pp. 1627-35). IEEE.

[40][40]Arslan E, Shekhar M, Kosar T. Locality and network-aware reduce task scheduling for data-intensive applications. In international workshop on data-intensive computing in the clouds 2014 (pp. 17-24). IEEE.

[41][41]Selvitopi O, Demirci GV, Turk A, Aykanat C. Locality-aware and load-balanced static task scheduling for MapReduce. Future Generation Computer Systems. 2019; 90:49-61.

[42][42]Xie J, Meng F, Wang H, Pan H, Cheng J, Qin X. Research on scheduling scheme for Hadoop clusters. Procedia computer science. 2013; 18:2468-71.

[43][43]Anjos JC, Carrera I, Kolberg W, Tibola AL, Arantes LB, Geyer CR. MRA++: Scheduling and data placement on MapReduce for heterogeneous environments. Future Generation Computer Systems. 2015; 42:22-35.

[44][44]Abad CL, Lu Y, Campbell RH. DARE: Adaptive data replication for efficient cluster scheduling. In international conference on cluster computing 2011 (pp. 159-68). IEEE.

[45][45]Jin H, Yang X, Sun XH, Raicu I. Adapt: Availability-aware MapReduce data placement for non-dedicated distributed computing. In international conference on distributed computing systems 2012 (pp. 516-25). IEEE.

[46][46]John SN, Mirnalinee TT. A novel dynamic data replication strategy to improve access efficiency of cloud storage. Information Systems and e-Business Management. 2020; 18(3):405-26.

[47][47]Polo J, Castillo C, Carrera D, Becerra Y, Whalley I, Steinder M, et al. Resource-aware adaptive scheduling for mapreduce clusters. In ACM/IFIP/USENIX international conference on distributed systems platforms and open distributed processing 2011(pp. 187-207). Springer, Berlin, Heidelberg.

[48][48]He C, Lu Y, Swanson D. Real-time scheduling in MapReduce clusters. In international conference on high performance computing and communications & international conference on embedded and ubiquitous computing 2013 (pp. 1536-44). IEEE.

[49][49]Liang Y, Wang Y, Fan M, Zhang C, Zhu Y. Predoop: preempting reduce task for job execution accelerations. In workshop on big data benchmarks, performance optimization, and emerging hardware 2014 (pp. 167-80). Springer, Cham.

[50][50]Pastorelli M, Carra D, Dell Amico M, Michiardi P. HFSP: bringing size-based scheduling to Hadoop. IEEE Transactions on Cloud Computing. 2015; 5(1):43-56.

[51][51]Verma A, Cherkasova L, Campbell RH. Aria: automatic resource inference and allocation for MapReduce environments. In proceedings of the ACM international conference on Autonomic computing 2011 (pp. 235-44).

[52][52]Voicu C, Pop F, Dobre C, Xhafa F. MOMC: multi-objective and multi-constrained scheduling algorithm of many tasks in Hadoop. In international conference on P2P, parallel, grid, cloud and internet computing 2014(pp. 89-96). IEEE.

[53][53]Han J, Yuan Z, Han Y, Peng C, Liu J, Li G. An adaptive scheduling algorithm for heterogeneous Hadoop systems. In international conference on computer and information science 2017 (pp. 845-50). IEEE.

[54][54]Dong X, Wang Y, Liao H. Scheduling mixed real-time and non-real-time applications in mapreduce environment. In international conference on parallel and distributed systems 2011 (pp. 9-16). IEEE.

[55][55]Liu L, Zhou Y, Liu M, Xu G, Chen X, Fan D, Wang Q. Preemptive Hadoop jobs scheduling under a deadline. In eighth international conference on semantics, knowledge and grids 2012 (pp. 72-9). IEEE.

[56][56]Cho B, Rahman M, Chajed T, Gupta I, Abad C, Roberts N, Lin P. Natjam: design and evaluation of eviction policies for supporting priorities and deadlines in MapReduce clusters. In proceedings of the annual symposium on cloud computing 2013 (pp. 1-17).

[57][57]Ullah I, Khan MS, Amir M, Kim J, Kim SM. LSTPD: least slack time-based preemptive deadline constraint scheduler for Hadoop clusters. IEEE Access. 2020; 8:111751-62.

[58][58]Mao H, Hu S, Zhang Z, Xiao L, Ruan L. A load-driven task scheduler with adaptive DSC for MapReduce. In international conference on green computing and communications 2011 (pp. 28-33). IEEE.

[59][59]Teng F, Yang H, Li T, Yang Y, Li Z. Scheduling real-time workflow on MapReduce-based cloud. In international conference on innovative computing technology 2013 (pp. 117-22). IEEE.

[60][60]Cheng D, Rao J, Guo Y, Zhou X. Improving MapReduce performance in heterogeneous environments with adaptive task tuning. In proceedings of the international middleware conference 2014(pp. 97-108).

[61][61]Rasooli A, Down DG. COSHH: a classification and optimization based scheduler for heterogeneous Hadoop systems. Future Generation Computer Systems. 2014; 36:1-5.

[62][62]Tang Z, Liu M, Ammar A, Li K, Li K. An optimized MapReduce workflow scheduling algorithm for heterogeneous computing. The Journal of Supercomputing. 2016; 72(6):2059-79.

[63][63]Li J, Liu Y, Pan J, Zhang P, Chen W, Wang L. Map-balance-reduce: an improved parallel programming model for load balancing of MapReduce. Future Generation Computer Systems. 2020; 105:993-1001.

[64][64]Ibrahim S, Jin H, Lu L, Wu S, He B, Qi L. Leen: locality/fairness-aware key partitioning for mapreduce in the cloud. In second international conference on cloud computing technology and science 2010 (pp. 17-24). IEEE.

[65][65]Nguyen P, Simon T, Halem M, Chapman D, Le Q. A hybrid scheduling algorithm for data intensive workloads in a MapReduce environment. In international conference on utility and cloud computing 2012(pp. 161-7). IEEE.

[66][66]Li Y, Lin C, Ren F, Geng Y. H-pfsp: Efficient hybrid parallel pfsp protected scheduling for mapreduce system. In international conference on trust, security and privacy in computing and communications 2013 (pp. 1099-106). IEEE.

[67][67]Wang J, Yao Y, Mao Y, Sheng B, Mi N. Fresh: fair and efficient slot configuration and scheduling for hadoop clusters. In international conference on cloud computing 2014 (pp. 761-8). IEEE.

[68][68]Zhao H, Yang S, Chen Z, Fan H, Xu J. K%-fair scheduling: a flexible task scheduling strategy for balancing fairness and efficiency in MapReduce systems. In proceedings of international conference on computer science and network technology 2012 (pp. 629-633). IEEE.

[69][69]Cheng YW, Lo SC. Improving fair scheduling performance on Hadoop. In international conference on platform technology and service (PlatCon) 2017 (pp. 1-6). IEEE.

[70][70]Hussain R, Rahman M, Masud KI, Roky SM, Akhtar MN, Tarin TA. A novel approach of fair scheduling to enhance performance of Hadoop distributed file system. In international conference on electrical, computer and communication engineering 2019 (pp. 1-6). IEEE.

[71][71]Chen Y, Alspaugh S, Borthakur D, Katz R. Energy efficiency for large-scale MapReduce workloads with significant interactive analysis. In proceedings of the ACM European conference on computer systems 2012 (pp. 43-56).

[72][72]Wang L, Khan SU, Chen D, Kołodziej J, Ranjan R, Xu CZ, Zomaya A. Energy-aware parallel task scheduling in a cluster. Future Generation Computer Systems. 2013; 29(7):1661-70.

[73][73]Lu Q, Li S, Zhang W. Genetic algorithm based job scheduling for big data analytics. In international conference on identification, information, and knowledge in the internet of things 2015(pp. 33-8). IEEE.

[74][74]Mashayekhy L, Nejad MM, Grosu D, Zhang Q, Shi W. Energy-aware scheduling of mapreduce jobs for big data applications. IEEE Transactions on Parallel and Distributed Systems. 2014; 26(10):2720-33.

[75][75]Wen YF. Energy-aware dynamical hosts and tasks assignment for cloud computing. Journal of Systems and Software. 2016; 115:144-56.

[76][76]Pandey V, Saini P. A heuristic method towards deadline-aware energy-efficient mapreduce scheduling problem in Hadoop YARN. Cluster Computing. 2021; 24(2):683-99.

[77][77]Wang J, Li X, Ruiz R, Yang J, Chu D. Energy utilization task scheduling for MapReduce in heterogeneous clusters. IEEE Transactions on Services Computing. 2020.

[78][78]Chen L, Liu ZH. Energy-and locality-efficient multi-job scheduling based on MapReduce for heterogeneous datacenter. Service Oriented Computing and Applications. 2019; 13(4):297-308.

[79][79]Yuan Z, Wang J. Research of scheduling strategy based on fault tolerance in Hadoop platform. In international conference on geo-informatics in resource management and sustainable ecosystem (pp. 509-17). Springer, Berlin, Heidelberg.

[80][80]Chen Q, Liu C, Xiao Z. Improving MapReduce performance using smart speculative execution strategy. IEEE Transactions on Computers. 2013; 63(4):954-67.

[81][81]Yildiz O, Ibrahim S, Phuong TA, Antoniu G. Chronos: failure-aware scheduling in shared Hadoop clusters. In international conference on big data (Big Data) 2015 (pp. 313-8). IEEE.

[82][82]Yildiz O, Ibrahim S, Antoniu G. Enabling fast failure recovery in shared Hadoop clusters: towards failure-aware scheduling. Future Generation Computer Systems. 2017; 74:208-19.

[83][83]Guo Y, Bland W, Balaji P, Zhou X. Fault tolerant MapReduce-MPI for HPC clusters. In proceedings of the international conference for high performance computing, networking, storage and analysis 2015 (pp. 1-12).

[84][84]Brahmwar M, Kumar M, Sikka G. Tolhit–a scheduling algorithm for Hadoop cluster. Procedia Computer Science. 2016; 89:203-8.

[85][85]Zhu Y, Samsudin J, Kanagavelu R, Zhang W, Wang L, Aye TT, et al. Fast recovery MapReduce (FAR-MR) to accelerate failure recovery in big data applications. The Journal of Supercomputing. 2020; 76(5):3572-88.