Machine Learning Prediction Models for Colorectal Cancer Based on the Novel Ensemble Framework

Research Article
Open access

Machine Learning Prediction Models for Colorectal Cancer Based on the Novel Ensemble Framework

Qing Mi 1*
  • 1 SHU-UTS SILC Business School, Shanghai University, Shanghai, China    
  • *corresponding author miqing0822@shu.edu.cn
ACE Vol.166
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-80590-177-8
ISBN (Online): 978-1-80590-178-5

Abstract

Colorectal Cancer (CRC) is a highly prevalent malignancy globally, and early prediction is crucial for improving prognosis. This study used a multidimensional CRC dataset (n=1000) provided by the Kaggle platform, which contains 14 clinical and lifestyle characteristics. First, data imbalance was mitigated through Random Oversampling (ROM) and standardization. Subsequently, a comprehensive evaluation was performed on seven baseline machine learning models, including Gradient Boosting Decision Tree (GBDT), eXtreme Gradient Boosting (XGBoost) and so on. Based on performance metrics such as accuracy and F1 score, GBDT and XGBoost were subsequently selected as the optimal base learners. Finally, the predictive probability features generated by the base learners are fed into the meta-learners such as Random Forest (RF), K Nearest Neighbor (KNN) and Multi-Layer Perceptron (MLP) for secondary modeling. The interpretability of the model is achieved through the Shapley Additive exPlanations (SHAP) value, which quantifies the marginal contribution of each feature to the prediction. Experiments show that the RF integration architecture based on GBDT and XGBoost baseline models has the best performance (accuracy of 0.9527 and AUC of 0.9923). SHAP analysis showed that Activity_Level and BMI were core predictors with limited contribution from gender, confirming the prioritization of exercise and weight management in CRC prevention. The framework demonstrated excellent robustness and maintained its predictive advantage even when inefficient base models e.g., Logistic Regression (LR) were introduced. This study provides an interpretable machine learning paradigm for CRC risk stratification with potential for clinical translation.

Keywords:

Ensemble machine learning, colorectal cancer, SHAP interpretability

Mi,Q. (2025). Machine Learning Prediction Models for Colorectal Cancer Based on the Novel Ensemble Framework. Applied and Computational Engineering,166,20-30.
Export citation

References

[1]. World Health Organization, "Colorectal cancer," Fact Sheet, Mar. 2023. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/colorectal-cancer. [Accessed: Apr. 25, 2025].

[2]. L. Li, "Research on image coding based on DCT and wavelet transform," M.S. thesis, Guangdong University of Technology, Guangzhou, China, 2008.

[3]. C. Fang, Y. Li, M. Xiong, et al., "Comparison of multiple linear regression and machine learning in predicting fear of cancer recurrence in newly diagnosed breast cancer patients," Journal of Wuhan University (Medical Sciences), pp. 1–7, 2023.

[4]. Z. Li, Y. Cai, Y. Wang, et al., "Machine learning-based prediction of cancer-specific survival after endoscopic treatment in early colorectal adenocarcinoma patients," Nursing Research, vol. 38, no. 14, pp. 2459–2467, 2024.

[5]. E. Y. Abbasi, "Cancer prediction and diagnosis using integrated multi-omics approaches with machine learning and deep learning models," Ph.D. dissertation, Beijing University of Posts and Telecommunications, 2024.

[6]. X. Pan, K. Tong, C. Yan, et al., "Research progress in colorectal cancer recognition using convolutional neural networks," Journal of Biomedical Engineering, vol. 41, no. 4, pp. 854–860, 2024.

[7]. Z. Y. A., "Colorectal cancer dietary and lifestyle dataset," Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/ziya07/colorectal-cancer-dietary-and-lifestyle-dataset. [Accessed: Apr. 25, 2025].

[8]. J. Luan, C. Zhang, B. Xu, Y. Xue, and Y. Ren, "The predictive performances of random forest models with limited sample size and different species traits," Fisheries Research, vol. 227, 105534, 2020.

[9]. S. Deng, W. Yuan, S. Guan, X. Lin, Z. Liao, and M. Li, "A decision tree algorithm based on adaptive entropy of feature value importance," Big Data Research, 100530, 2025.

[10]. H. I. Abdalla and A. A. Amer, "Enhancing data classification using locally informed weighted k-nearest neighbor algorithm," Expert Systems with Applications, vol. 276, 126942, 2025.

[11]. X. Zhao, P.-F. Zhang, D. Zhang, Q. Zhao, and Y. Tuerxunmaimaiti, "Prediction of interlaminar shear strength retention of FRP bars in marine concrete environments using XGBoost model," Journal of Building Engineering, vol. 105, 112466, 2025.

[12]. J. Luo, Y. Yuan, and S. Xu, "Improving GBDT performance on imbalanced datasets: An empirical study of class-balanced loss functions," Neurocomputing, vol. 634, 129896, 2025.

[13]. S. Sarakon, W. Massagram, and K. Tamee, "Multisource data fusion using MLP for human activity recognition," Computers, Materials and Continua, vol. 82, no. 2, pp. 2109–2136, 2025.

[14]. Z. Zhang, M. Tantai, H. Ma, S. Yu, B. Chen, and Z. Lu, "Analysis of risk factors for lumbar spondylolisthesis: A logistic regression study," World Neurosurgery, vol. 197, 123931, 2025.

[15]. X. Luo, Y. Ju, S. Meng, et al., "The relationship between smoking and gut microbiota and inflammation in patients with colorectal adenoma," Chinese Journal of Microecology, vol. 37, no. 1, pp. 70–77, 2025.

[16]. X. Ma, N. Li, H. Zhao, et al., "Analysis of incidence and mortality of colorectal cancer and its risk factors in China in 1990 and 2021," Journal of Practical Oncology, vol. 40, no. 2, pp. 114–119, 2025.

[17]. J. Li, Z. Lan, W. Liao, et al., "Histone demethylase KDM5D upregulation drives sex differences in colon cancer," Nature, vol. 619, pp. 632–639, 2023.


Cite this article

Mi,Q. (2025). Machine Learning Prediction Models for Colorectal Cancer Based on the Novel Ensemble Framework. Applied and Computational Engineering,166,20-30.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of CONF-SEML 2025 Symposium: Machine Learning Theory and Applications

ISBN:978-1-80590-177-8(Print) / 978-1-80590-178-5(Online)
Editor:Hui-Rang Hou
Conference date: 18 May 2025
Series: Applied and Computational Engineering
Volume number: Vol.166
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. World Health Organization, "Colorectal cancer," Fact Sheet, Mar. 2023. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/colorectal-cancer. [Accessed: Apr. 25, 2025].

[2]. L. Li, "Research on image coding based on DCT and wavelet transform," M.S. thesis, Guangdong University of Technology, Guangzhou, China, 2008.

[3]. C. Fang, Y. Li, M. Xiong, et al., "Comparison of multiple linear regression and machine learning in predicting fear of cancer recurrence in newly diagnosed breast cancer patients," Journal of Wuhan University (Medical Sciences), pp. 1–7, 2023.

[4]. Z. Li, Y. Cai, Y. Wang, et al., "Machine learning-based prediction of cancer-specific survival after endoscopic treatment in early colorectal adenocarcinoma patients," Nursing Research, vol. 38, no. 14, pp. 2459–2467, 2024.

[5]. E. Y. Abbasi, "Cancer prediction and diagnosis using integrated multi-omics approaches with machine learning and deep learning models," Ph.D. dissertation, Beijing University of Posts and Telecommunications, 2024.

[6]. X. Pan, K. Tong, C. Yan, et al., "Research progress in colorectal cancer recognition using convolutional neural networks," Journal of Biomedical Engineering, vol. 41, no. 4, pp. 854–860, 2024.

[7]. Z. Y. A., "Colorectal cancer dietary and lifestyle dataset," Kaggle, 2023. [Online]. Available: https://www.kaggle.com/datasets/ziya07/colorectal-cancer-dietary-and-lifestyle-dataset. [Accessed: Apr. 25, 2025].

[8]. J. Luan, C. Zhang, B. Xu, Y. Xue, and Y. Ren, "The predictive performances of random forest models with limited sample size and different species traits," Fisheries Research, vol. 227, 105534, 2020.

[9]. S. Deng, W. Yuan, S. Guan, X. Lin, Z. Liao, and M. Li, "A decision tree algorithm based on adaptive entropy of feature value importance," Big Data Research, 100530, 2025.

[10]. H. I. Abdalla and A. A. Amer, "Enhancing data classification using locally informed weighted k-nearest neighbor algorithm," Expert Systems with Applications, vol. 276, 126942, 2025.

[11]. X. Zhao, P.-F. Zhang, D. Zhang, Q. Zhao, and Y. Tuerxunmaimaiti, "Prediction of interlaminar shear strength retention of FRP bars in marine concrete environments using XGBoost model," Journal of Building Engineering, vol. 105, 112466, 2025.

[12]. J. Luo, Y. Yuan, and S. Xu, "Improving GBDT performance on imbalanced datasets: An empirical study of class-balanced loss functions," Neurocomputing, vol. 634, 129896, 2025.

[13]. S. Sarakon, W. Massagram, and K. Tamee, "Multisource data fusion using MLP for human activity recognition," Computers, Materials and Continua, vol. 82, no. 2, pp. 2109–2136, 2025.

[14]. Z. Zhang, M. Tantai, H. Ma, S. Yu, B. Chen, and Z. Lu, "Analysis of risk factors for lumbar spondylolisthesis: A logistic regression study," World Neurosurgery, vol. 197, 123931, 2025.

[15]. X. Luo, Y. Ju, S. Meng, et al., "The relationship between smoking and gut microbiota and inflammation in patients with colorectal adenoma," Chinese Journal of Microecology, vol. 37, no. 1, pp. 70–77, 2025.

[16]. X. Ma, N. Li, H. Zhao, et al., "Analysis of incidence and mortality of colorectal cancer and its risk factors in China in 1990 and 2021," Journal of Practical Oncology, vol. 40, no. 2, pp. 114–119, 2025.

[17]. J. Li, Z. Lan, W. Liao, et al., "Histone demethylase KDM5D upregulation drives sex differences in colon cancer," Nature, vol. 619, pp. 632–639, 2023.