Research Article
Open access
Published on 31 January 2024
Download pdf
Ren,X. (2024). Predictions of diabetes through machine learning models based on the health indicators dataset. Applied and Computational Engineering,32,216-222.
Export citation

Predictions of diabetes through machine learning models based on the health indicators dataset

Xinyi Ren *,1,
  • 1 Lancaster University, Lancashire, LA1 4YW, The United Kingdom

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2755-2721/32/20230214

Abstract

Diabetes is a chronic disease that is widespread in the United States. Patients with diabetes will lose the ability to effectively regulate blood glucose levels and the disease can lead to increased economic burden for patients and generate enormous public health impact. The main purpose of this paper is to find out the indicators that are highly associated with diabetes and build a model to predict diabetes. The original dataset is from BRFSS (the Behavioral Risk Factor Surveillance System). For this project, a cleaned dataset on Kaggle for the year 2015 was used, which has 253,680 survey responses to CDC (Centers for Disease Control and Prevention)'s BRFSS with the target variable diabetes and 21 feature variables. The Chi-square test was applied to investigate the association between feature indicators and diabetes and built several machine learning models for predicting the disease. The selected model is Cat Boost Classifier with 86.6% accuracy for the testing set. According to the Permutation Feature Importance based on the Cat Boost Classifier, the most important 5 features were General Health (GenHlth), BMI (Body Mass Index), Age, high blood pressure (HighBP), and high cholesterol (HighChol) variables.

Keywords

diabetes prediction, machine learning, health indicators, classification

[1]. Kumari, V. A. and Chitra, R. (2013). Classification of Diabetes Disease Using Support Vector Machine. International Journal of Engineering Research and Applications (IJERA), 3, 1797-1801.

[2]. U.S. Department of Health and Human Services Centers for Disease Control and Prevention. (2020). National Diabetes Statistics Report Estimates of Diabetes and Its Burden in the United States. https://www.cdc.gov/diabetes/pdfs/data/statistics/national-diabetes-statistics-report.pdf.

[3]. Fang, M., Wang, D., Coresh, J. and Selvin, E. (2021). Trends in Diabetes Treatment and Control in U.S. Adults, 1999-2018. The New England journal of medicine, 384(23), 2219–2228.

[4]. National Center for Chronic Disease Prevention and Health Promotion. Division of Population Health. https://www.cdc.gov/brfss/index.html.

[5]. Teboul, A. (2021). Diabetes Health Indicators Dataset. https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-datase t/code.

[6]. Bahassine, S., Madani, A., Al-Sarem, M. and Kissi, M. (2020). Feature selection using an improved Chi-square for Arabic text classification. Journal of King Saud University - Computer and Information Sciences Volume, 32(2), 225-231.

[7]. Kumar Gajawada, S. (2019). Chi-Square Test for Feature Selection in Machine learning. Published in Towards Data Science. https://j-pcs.org/temp/JPractCardiovascSci1169-9537648_023857.pdf.

[8]. Satwik, M. (2017). Handling Imbalanced Data: SMOTE vs. Random Undersampling. International Research Journal of Engineering and Technology (IRJET), 4(8), 317-320.

[9]. Hanan, A., Yuan, X. H., Esterline, A., Khorsandroo, S. and Lu, X. C. (2021). Studying the Effects of Feature Scaling in Machine Learning. Ph.D. Dissertation. North Carolina Agricultural and Technical State University. Advisor(s) Xu, Jinsheng. Order Number: AAI28772109.

[10]. Gürsoy, M. İ. and Alkan, A. (2022). Investigation Of Diabetes Data with Permutation Feature Importance Based Deep Learning Methods. Karadeniz Fen Bilimleri Dergisi. The Black Sea Journal of Sciences. ISSN (Online): 2564-7377.

[11]. Li, S. (2022). Best Practice to Calculate and Interpret Model Feature Importance: With an example of Random Forest model. Published in Towards Data Science. https://towardsdatascience.com/best-practice-to-calculate-and-interpret-model-feature-importance-14f0e11ee660.

Cite this article

Ren,X. (2024). Predictions of diabetes through machine learning models based on the health indicators dataset. Applied and Computational Engineering,32,216-222.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Machine Learning and Automation

Conference website: https://2023.confmla.org/
ISBN:978-1-83558-289-3(Print) / 978-1-83558-290-9(Online)
Conference date: 18 October 2023
Editor:Mustafa İSTANBULLU
Series: Applied and Computational Engineering
Volume number: Vol.32
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).