International Journal of Advanced Computer Research (IJACR) ISSN (P): 2249-7277 ISSN (O): 2277-7970 Vol - 10, Issue - 47, March 2020
  1. 1
    Google Scholar
  2. 4
    Impact Factor
Traditional machine learning and big data analytics in virtual screening: a comparative study

Sahar K. Hussin, Yasser M. Omar, Salah M. Abdelmageid and Mahmoud I. Marie

Abstract

Nowadays, the massive amount of data that needs to be processed is increased. High-performance computing (HPC) and big data analytics are required. In the identical context, research on drug discovery has reached an area where it has no preference, but the use of HPC and huge data processing systems to perform its targets at a reasonable time. Virtual screen (VS) is one of the costliest tasks in terms of computation requirements. It is considered as an intensive and heavy task. At the same time, it plays an essential role in new drug design. This research investigates machine learning and big data analytics in VS. It tries to use a ligand base and a structural base and rank molecular databases as active against a specific target protein. The machine learning algorithms, including random forests, naive Bayesian classifiers, nerve networks, decision trees, support vector machines, and deep-learning strategies have been developed for both Ligand-based and structure-based docking. Also, this paper introduces a review of previous research conducted on the utilization of machine learning as well as big data analytics framework in VS. The paper outlines the current progress in the use of traditional methods for machine learning and massive data analytic applications in a multi-node dataset. This article compares the estimation of machine learning approaches and broad ligand-base theoretical system. It also explores how machine learning approaches can improve the performance of various problems of virtual screening classification in broad repositories. Finally, various challenges and solutions of the virtual screening dataset in the machine learning and big data analytics are discussed.

Keyword

Drug discovery, Virtual screening, Descriptors, Machine learning and Big data analytics frameworks.

Cite this article

Hussin SK, Omar YM, Abdelmageid SM, Marie MI

Refference

[1][1]Ross K. Protein bioinformatics: from protein modifications and networks to proteins. Humana Press. 2017.

[2][2]Chen B, Harrison RF, Papadatos G, Willett P, Wood DJ, Lewell XQ, et al. Evaluation of machine-learning methods for ligand-based virtual screening. Journal of Computer-Aided Molecular Design. 2007; 21(1-3):53-62.

[3][3]Yang H, Chen J, Tang S, Li Z, Zhen Y, Huang L, et al. New drug R&D of traditional Chinese medicine: role of data mining approaches. Journal of Biological Systems. 2009; 17(3):329-47.

[4][4]Gaulton A, Bellis LJ, Bento AP, Chambers J, Davies M, Hersey A, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research. 2012; 40(D1):D1100-7.

[5][5]Maltarollo VG, Kronenberger T, Espinoza GZ, Oliveira PR, Honorio KM. Advances with support vector machines for novel drug discovery. Expert Opinion on Drug Discovery. 2019; 14(1):23-33.

[6][6]Shoichet BK. Virtual screening of chemical libraries. Nature. 2004; 432:862-5.

[7][7]Afolabi LT, Saeed F, Hashim H, Petinrin OO. Ensemble learning method for the prediction of new bioactive molecules. PloS One. 2018; 13(1):1-14.

[8][8]Neves BJ, Braga RC, Melo-Filho CC, Moreira-Filho JT, Muratov EN, Andrade CH. QSAR-based virtual screening: advances and applications in drug discovery. Frontiers in Pharmacology. 2018; 9:1-7.

[9][9]Huang HJ, Yu HW, Chen CY, Hsu CH, Chen HY, Lee KJ, et al. Current developments of computer-aided drug design. Journal of the Taiwan Institute of Chemical Engineers. 2010; 41(6):623-35.

[10][10]Liu X, Xu Y, Li S, Wang Y, Peng J, Luo C, et al. In silicotarget fishing: addressing a “Big Data” problem by ligand-based similarity rankings with data fusion. Journal of Cheminformatics. 2014; 6:1-14.

[11][11]Lavecchia A. Machine-learning approaches in drug discovery: methods and applications. Drug Discovery Today. 2015; 20(3):318-31.

[12][12]Ahmed L, Edlund A, Laure E, Spjuth O. Using iterative MapReduce for parallel virtual screening. In 5th international conference on cloud computing technology and science 2013 (pp. 27-32). IEEE.

[13][13]Ballester PJ, Mitchell JB. A machine learning approach to predicting protein–ligand binding affinity with applications to molecular docking. Bioinformatics. 2010; 26(9):1169-75.

[14][14]Thai KM, Nguyen TQ, Ngo TD, Tran TD, Huynh TN. A support vector machine classification model for benzo [c] phenathridine analogues with topoisomerase-I inhibitory activity. Molecules. 2012; 17(4):4560-82.

[15][15]Lionta E, Spyrou G, K Vassilatis D, Cournia Z. Structure-based virtual screening for drug discovery: principles, applications and recent advances. Current Topics in Medicinal Chemistry. 2014; 14(16):1923-38.

[16][16]https://en.wikipedia.org/wiki/Virtual_screening. Accessed 21 November 2019.

[17][17]Banerjee P, Preissner R. BitterSweetForest: a random forest based binary classifier to predict bitterness and sweetness of chemical compounds. Frontiers in Chemistry. 2018; 6:1-10.

[18][18]Xiong Y, Qiao Y, Kihara D, Zhang HY, Zhu X, Wei DQ. Survey of machine learning techniques for prediction of the isoform specificity of cytochrome P450 substrates. Current Drug Metabolism. 2019; 20(3):229-35.

[19][19]Ponzoni I, Sebastián-Pérez V, Martínez MJ, Roca C, De la Cruz Pérez C, Cravero F, et al. QSAR classification models for predicting the activity of inhibitors of beta-secretase (BACE1) associated with alzheimer’s disease. Scientific Reports. 2019; 9:1-13.

[20][20]Muegge I, Mukherjee P. An overview of molecular fingerprint similarity search in virtual screening. Expert Opinion on Drug Discovery. 2016; 11(2):137-48.

[21][21]Korkmaz S, Zararsiz G, Goksuluk D. Drug/nondrug classification using support vector machines with various feature selection strategies. Computer Methods and Programs in Biomedicine. 2014; 117(2):51-60.

[22][22]Li Y, Kong Y, Zhang M, Yan A, Liu Z. Using support vector machine (SVM) for classification of selectivity of H1N1 neuraminidase inhibitors. Molecular Informatics. 2016; 35(3‐4):116-24.

[23][23]Kumar A, Verma DK, Purohit R. Conceptual modelling of telapathic network. Metabolomics. 2012; 2(5).

[24][24]Ani R, Manohar R, Anil G, Deepa OS. Virtual screening of drug likeness using tree based ensemble classifier. Biomedical and Pharmacology Journal. 2018; 11(3):1513-9.

[25][25]Yosipof A, Guedes RC, García-Sosa AT. Data mining and machine learning models for predicting drug likeness and their disease or organ category. Frontiers in Chemistry. 2018; 6:1-11.

[26][26]Bahi M, Batouche M. Deep semi-supervised learning for virtual screening based on big data analytics. In international conference on big data, cloud and applications 2018 (pp. 173-84). Springer, Cham.

[27][27]Bahi M, Batouche M. Drug-target interaction prediction in drug repositioning based on deep semi-supervised learning. In international conference on computational intelligence and its applications 2018 (pp. 302-13). Springer, Cham.

[28][28]Khan A, Kaushik AC, Ali SS, Ahmad N, Wei DQ. Deep-learning-based target screening and similarity search for the predicted inhibitors of the pathways in Parkinsons disease. RSC Advances. 2019; 9:10326-39.

[29][29]Ramsundar B, Kearnes S, Riley P, Webster D, Konerding D, Pande V. Massively multitask networks for drug discovery. arXiv preprint arXiv:1502.02072. 2015.

[30][30]Inglese P, McKenzie JS, Mroz A, Kinross J, Veselkov K, Holmes E, et al. Deep learning and 3D-DESI imaging reveal the hidden metabolic heterogeneity of cancer. Chemical Science. 2017; 8:3500-11.

[31][31]Constantine RM, Batouche M. Drug discovery for breast cancer based on big data analytics techniques. In international conference on information & communication technology and accessibility 2015 (pp. 1-6). IEEE.

[32][32]Sid K, Batouche M. Ensemble learning for large scale virtual screening on apache spark. In IFIP international conference on computational intelligence and its applications 2018 (pp. 244-56). Springer, Cham.

[33][33]Byvatov E, Fechner U, Sadowski J, Schneider G. Comparison of support vector machine and artificial neural network systems for drug/nondrug classification. Journal of Chemical Information and Computer Sciences. 2003; 43(6):1882-9.

[34][34]Zernov VV, Balakin KV, Ivaschenko AA, Savchuk NP, Pletnev IV. Drug discovery using support vector machines, the case studies of drug-likeness, agrochemical-likeness, and enzyme inhibition predictions. Journal of Chemical Information and Computer Sciences. 2003; 43(6):2048-56.

[35][35]Warmuth MK, Liao J, Rätsch G, Mathieson M, Putta S, Lemmen C. Active learning with support vector machines in the drug discovery process. Journal of Chemical Information and Computer Sciences. 2003; 43(2):667-73.

[36][36]Jorissen RN, Gilson MK. Virtual screening of molecular databases using a support vector machine. Journal of Chemical Information and Modeling. 2005; 45(3):549-61.

[37][37]Podolyan Y, Walters MA, Karypis G. Assessing synthetic accessibility of chemical compounds using machine learning methods. Journal of Chemical Information and Modeling. 2010; 50(6):979-91.

[38][38]Cheng T, Li Q, Wang Y, Bryant SH. Binary classification of aqueous solubility using support vector machines with reduction and recombination feature selection. Journal of Chemical Information and Modeling. 2011; 51(2):229-36.

[39][39]Camps-Valls G, Bruzzone L. Kernel-based methods for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. 2005; 43(6):1351-62.

[40][40]Sato T, Honma T, Yokoyama S. Combining machine learning and pharmacophore-based interaction fingerprint for in silico screening. Journal of Chemical Information and Modeling. 2010; 50(1):170-85.

[41][41]Von Korff M, Sander T. Toxicity-indicating structural patterns. Journal of Chemical Information and Modeling. 2006; 46(2):536-44.

[42][42]Abdo A, Chen B, Mueller C, Salim N, Willett P. Ligand-based virtual screening using bayesian networks. Journal of Chemical Information and Modeling. 2010; 50(6):1012-20.

[43][43]Gleeson MP, Waters NJ, Paine SW, Davis AM. In silico human and rat V ss quantitative structure−activity relationship models. Journal of Medicinal Chemistry. 2006; 49(6):1953-63.

[44][44]Ai S, Bai Y, Liu X. Virtual screening for COX-2 inhibitors with random forest algorithm and feature selection. In proceedings of the international conference on bioinformatics research and applications 2017 (pp. 9-14).

[45][45]Lee K, Lee M, Kim D. Utilizing random forest QSAR models with optimized parameters for target identification and its application to target-fishing server. BMC Bioinformatics. 2017; 18(16):75-86.

[46][46]Kauffman GW, Jurs PC. QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors. Journal of Chemical Information and Computer Sciences. 2001; 41(6):1553-60.

[47][47]Itskowitz P, Tropsha A. k nearest neighbors QSAR modeling as a variational problem: theory and applications. Journal of Chemical Information and Modeling. 2005; 45(3):777-85.

[48][48]Patel JL, Patel LD. Artificial neural networks and their applications in pharmaceutical research. Pharmabuzz. 2007; 2:8-17.

[49][49]Soyguder S. Intelligent control based on wavelet decomposition and neural network for predicting of human trajectories with a novel vision-based robotic. Expert Systems with Applications. 2011; 38(11):13994-4000.

[50][50]Behrmann J, Etmann C, Boskamp T, Casadonte R, Kriegsmann J, Maaβ P. Deep learning for tumor classification in imaging mass spectrometry. Bioinformatics. 2018; 34(7):1215-23.

[51][51]Pérez-Sianes J, Pérez-Sánchez H, Díaz F. Virtual screening meets deep learning. Current Computer-aided Drug Design. 2019; 15(1):6-28.

[52][52]Koutsoukas A, Lowe R, KalantarMotamedi Y, Mussa HY, Klaffke W, Mitchell JB, et al. In silico target predictions: defining a benchmarking data set and comparison of performance of the multiclass Naïve Bayes and Parzen-Rosenblatt window. Journal of Chemical Information and Modeling. 2013; 53(8):1957-66.

[53][53]https://pubchem.ncbi.nlm.nih.gov. Accessed 14 November 2019.

[54][54]https://spark.apache.org/. Accessed 14 November 2019.

[55][55]Fathima AJ, Murugaboopathi G. A novel customized big data analytics framework for drug discovery. Journal of Cyber Security and Mobility. 2018; 7(1):145-60.

[56][56]García-Sosa AT, Oja M, Hetényi C, Maran U. DrugLogit: logistic discrimination between drugs and nondrugs including disease-specificity by assigning probabilities based on molecular properties. Journal of Chemical Information and Modeling. 2012; 52(8):2165-80.

[57][57]Khaldy MA, Kambhampati C. Resampling imbalanced class and the effectiveness of feature selection methods for heart failure dataset. International Robotics & Automation Journal. 2018; 4(1):37-45.

[58][58]Jahan S, Shatabda S, Farid DM. Active learning for mining big data. In international conference of computer and information technology (ICCIT) 2018 (pp. 1-6). IEEE.