International Journal of Advanced Technology and Engineering Exploration (IJATEE) ISSN (P): 2394-5443 ISSN (O): 2394-7454 Vol - 9, Issue - 90, May 2022
  1. 1
    Google Scholar
ETL for disease indicators using brute force rule-based NLP algorithm and metadata exploration

Ifra Altaf, Muheet Ahmed Butt and Majid Zaman

Abstract

As data driven decisions are based on facts, data collection can be used to lay a foundation for decision-making irrespective of industry. With the decision-making capability provided by the data from various digital medical records, the doctors can provide a precise diagnosis and a sufficient treatment by fitting together fundamentally different disease symptoms. This data manuscript describes the preparation procedure of a diabetes dataset from the panels of liver and lipid profile. The data is collected from a medical center in Srinagar, Jammu and Kashmir in the form of unstructured data reports. The unstructured data is extracted on the basis of the metadata of the source document; the required data field values of different tests are extracted from the intermediate file using the brute force pattern matching heuristics and integrated together to fill the relational database. The database can be used for further descriptive, exploratory as well as predictive data analysis and can be helpful in diagnosing and predicting the diabetes disease of the liver and lipid panels. This paper presents a novel concept to predict and detect one disease from the markers of other related disease/s as a way to fill the theoretical research gap. The detection rate achieved by our proposed brute force rule-based natural language processing (NLP) algorithm is recorded as 98.44%.

Keyword

PDF scraping, Unstructured data, Diagnostic lab reports, Heuristics, Brute force, Natural language processing, Metadata, Information extraction.

Cite this article

Altaf I, Butt MA, Zaman M

Refference

[1][1]Soni J, Ansari U, Sharma D, Soni S. Predictive data mining for medical diagnosis: an overview of heart disease prediction. International Journal of Computer Applications. 2011; 17(8):43-8.

[2][2]Natarajan Y, Kannan S, Mohanty SN. Survey of various statistical numerical and machine learning ontological models on infectious disease ontology. Data Analytics in Bioinformatics: a Machine Learning Perspective. 2021: 431-42.

[3][3]Taylor‐weiner A, Pokkalla H, Han L, Jia C, Huss R, Chung C, et al. A machine learning approach enables quantitative measurement of liver histology and disease monitoring in NASH. Hepatology. 2021; 74(1):133-47.

[4][4]Huang S, Yang J, Fong S, Zhao Q. Artificial intelligence in the diagnosis of COVID-19: challenges and perspectives. International Journal of Biological Sciences. 2021; 17(6).

[5][5]Rehman A, Iqbal MA, Xing H, Ahmed I. COVID-19 detection empowered with machine learning and deep learning techniques: a systematic review. Applied Sciences. 2021; 11(8):3414.

[6][6]Bhavsar KA, Abugabah A, Singla J, AlZubi AA, Bashir AK. A comprehensive review on medical diagnosis using machine learning. Computers, Materials and Continua. 2021; 67(2):1997-2014.

[7][7]Ahsan MM, Siddique Z. Machine learning-based heart disease diagnosis: a systematic literature review. Artificial Intelligence in Medicine. 2022.

[8][8]Ibrahim I, Abdulazeez A. The role of machine learning algorithms for diagnosing diseases. Journal of Applied Science and Technology Trends. 2021; 2(1):10-9.

[9][9]Shaheen MY. Adoption of machine learning for medical diagnosis. ScienceOpen Preprints. 2021.

[10][10]Ahsan MM, Mahmud MA, Saha PK, Gupta KD, Siddique Z. Effect of data scaling methods on machine learning algorithms and model performance. Technologies. 2021; 9(3):1-17.

[11][11]Dash S, Shakyawar SK, Sharma M, Kaushik S. Big data in healthcare: management, analysis and future prospects. Journal of Big Data. 2019; 6(1):1-25.

[12][12]Osop H, Sahama T. Data-driven and practice-based evidence: design and development of efficient and effective clinical decision support system. In improving health management through clinical decision support systems 2016 (pp. 295-328). IGI Global.

[13][13]Bernell S, Howard SW. Use your words carefully: what is a chronic disease? Frontiers in Public Health. 2016.

[14][14]Philip R, Mathias M, KM DG. Evalation of relationship between markers of liver function and the onset of type 2 diabetes. Journal of Health and Allied Sciences NU. 2014; 4(2):90-3.

[15][15]Santos-Gallego CG, Rosenson RS. Role of HDL in those with diabetes. Current Cardiology Reports. 2014; 16(9):1-4.

[16][16]https://www.astera.com/type/blog/pdf-scraping/. Accessed 20 September 2021.

[17][17]Blonce A, Filiol E, Frayssignes L. Portable document format (pdf) security analysis and malware threats. In presentations of Europe BlackHat 2008.

[18][18]Sumathi S, Esakkirajan S. Fundamentals of relational database management systems. Springer; 2007.

[19][19]Hashmi AM, Qayyum F, Afzal MT. Insights to the state-of-the-art PDF extraction techniques. IPSI Trans. Internet Res. 2020; 16(8):1-8.

[20][20]Ahmad R, Afzal MT, Qadir MA. Information extraction from PDF sources based on rule-based system using integrated formats. In semantic web evaluation challenge 2016 (pp. 293-308). Springer, Cham.

[21][21]Sateli B, Witte R. An automatic workflow for the formalization of scholarly articles’ structural and semantic elements. In semantic web evaluation challenge 2016 (pp. 309-20). Springer, Cham.

[22][22]Klampfl S, Kern R. Reconstructing the logical structure of a scientific publication using machine learning. In semantic web evaluation challenge 2016 (pp. 255-68). Springer, Cham.

[23][23]Azimjonov J, Alikhanov J. Rule based metadata extraction framework from academic articles. arXiv preprint arXiv:1807.09009. 2018.

[24][24]Achilonu OJ, Singh E, Nimako G, Eijkemans RM, Musenge E. Rule-based information extraction from free-text pathology reports reveals trends in South African female breast cancer molecular subtypes and Ki67 expression. BioMed Research International. 2022.

[25][25]Mandal A, Bhattarai B, Kafle P, Khalid M, Jonnadula SK, Lamicchane J, et al. Elevated liver enzymes in patients with type 2 diabetes mellitus and non-alcoholic fatty liver disease. Cureus. 2018; 10(11).

[26][26]Bhowmik B, Siddiquee T, Mujumder A, Afsana F, Ahmed T, Mdala IA, et al. Serum lipid profile and its association with diabetes and prediabetes in a rural Bangladeshi population. International Journal of Environmental Research and Public Health. 2018; 15(9):1-12.

[27][27]Singh A, Dalal D, Malik AK, Chaudhary A. Deranged liver function tests in type 2 diabetes: a retrospective study. International Journal of Science and Healthcare Research. 2019; 4(3):27-31.

[28][28]Majid MA, Bashet MA, Moonajilin MS, Siddique M. A study on evaluating lipid profile of patients with diabetes mellitus. 2019.

[29][29]Shahwan MJ, Khattab AH, Khattab MH, Jairoun AA. Association between abnormal serum hepatic enzymes, lipid levels and glycemic control in patients with type 2 diabetes mellitus. Obesity Medicine. 2019.

[30][30]Islam S, Rahman S, Haque T, Sumon AH, Ahmed AM, Ali N. Prevalence of elevated liver enzymes and its association with type 2 diabetes: a cross‐sectional study in Bangladeshi adults. Endocrinology, Diabetes & Metabolism. 2020; 3(2).

[31][31]Blomdahl J, Nasr P, Ekstedt M, Kechagias S. Moderate alcohol consumption is associated with advanced fibrosis in non-alcoholic fatty liver disease and shows a synergistic effect with type 2 diabetes mellitus. Metabolism. 2021.

[32][32]Tham YK, Jayawardana KS, Alshehry ZH, Giles C, Huynh K, Smith AA, et al. Novel lipid species for detecting and predicting atrial fibrillation in patients with type 2 diabetes. Diabetes. 2021; 70(1):255-61.

[33][33]Kosmalski M, Ziółkowska S, Czarny P, Szemraj J, Pietras T. The coexistence of nonalcoholic fatty liver disease and type 2 diabetes mellitus. Journal of Clinical Medicine. 2022; 11(5):1-24.

[34][34]Altaf I, Butt MA, Zaman M. Disease detection and prediction using the liver function test data: a review of machine learning algorithms. In international conference on innovative computing and communications 2022 (pp. 785-800). Springer, Singapore.

[35][35]Godfrey KR. Correlation methods. Automatica. 1980; 16(5):527-34.

[36][36]Benesty J, Chen J, Huang Y, Cohen I. Pearson correlation coefficient. In noise reduction in speech processing 2009 (pp. 1-4). Springer, Berlin, Heidelberg.

[37][37]Sedgwick P. Spearman’s rank correlation coefficient. BMJ. 2014.

[38][38]Abdi H. The kendall rank correlation coefficient. Encyclopedia of Measurement and Statistics. Sage, Thousand Oaks, CA. 2007:508-10.

[39][39]Aparicio M, Costa CJ. Data visualization. Communication design quarterly review. 2015; 3(1):7-11.

[40][40]Vellido A. The importance of interpretability and visualization in machine learning for applications in medicine and health care. Neural Computing and Applications. 2020; 32(24):18069-83.

[41][41]Wang Y, Han F, Zhu L, Deussen O, Chen B. Line graph or scatter plot? automatic selection of methods for visualizing trends in time series. IEEE Transactions on Visualization and Computer Graphics. 2017; 24(2):1141-54.

[42][42]Moon KW. Bar plot (I). In Learn ggplot2 Using Shiny App 2016 (pp. 111-20). Springer, Cham.

[43][43]Hicks SC, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth quantile normalization. Biostatistics. 2018; 19(2):185-98.

[44][44]Zheng A, Casari A. Feature engineering for machine learning: principles and techniques for data scientists. OReilly Media, Inc.; 2018.

[45][45]Köpp C, Von MHJ, Breitner MH. Decision analytics with heatmap visualization for multi-step ensemble data. Business & Information Systems Engineering. 2014; 6(3):131-40.

[46][46]Friendly M. A brief history of the mosaic display. Journal of Computational and Graphical Statistics. 2002; 11(1):89-107.

[47][47]Friendly M. Graphical methods for categorical data. Proceedings of SAS SUGI. 1992; 17:1-7.

[48][48]Demšar J, Leban G, Zupan B. FreeViz-an intelligent multivariate visualization approach to explorative analysis of biomedical data. Journal of Biomedical Informatics. 2007; 40(6):661-71.

[49][49]Demsar J, Leban G, Zupan B. Freeviz-an intelligent visualization approach for class-labeled multidimensional data sets. Proceedings of IDAMAP. 2005; 1:13-8.