
Comparison of K-Means, K-Medoids and K-Means++ algorithms based on the Calinski-Harabasz index for COVID-19 epidemic in China
- 1 Taiyuan University of Technology
* Author to whom correspondence should be addressed.
Abstract
The novel coronavirus spreads from person to person through close contact and respiratory droplets such as coughing or sneezing. Various studies have been conducted globally to deal with COVID-19. However, no cure for the virus has been found , and efficient data processing methods for sudden outbreaks have not yet been identified. This study compares three algorithms for data sets to analyze clustering patterns to determine the best data processing method. The data of this study comes from the Chinese Center for Disease Control and Prevention, including two attributes of confirmed cases and death cases. We selected the data from the initial stage of the outbreak until October 31, 2021. We compared the data analysis and processing results of the clustering of the spread of the new coronavirus in China by the K-Means, K-Medoids and K-Means++ algorithms. By comparing the Calinski-Harabasz index values from K=2 to K=10, the results show that the K-Means, K-Medoids and K-Means++ algorithms have almost the same clustering effect when K does not exceed 6, but when the K value is greater than 6. When the K-Medoids clustering effect is significantly better, therefore, from the three clustering algorithms used, it can be concluded that the best method for clustering the spread of the novel coronavirus outbreak in China is the K-Medoids method. The results of this study provides ideas for future researchers to choose an appropriate cluster analysis method to effectively process the data in the early stages of the epidemic.
Keywords
COVID-19, Calinski-Harabasz, K-Means, K-Medoids, K-Means++
[1]. M. A. Shereen, S. Khan, A. Kazmi, NBashir, and R. Siddique, COVID- 19 infection: Origin, transmission, and characteristics of human coronaviruses, Journal of Advanced Research 24 (2020) 91–98.
[2]. N.Dwitri dkk, Penerapan Algoritma K-Means dalam Menentukan Tingkat Penyebaran Pandemi Covid- 19 di Indonesia, Jurnal Teknologi Informasi, Vol. 4, No. 1, Juni 2020.
[3]. R.A. Indraputra , R. Fitriana, K-Means Clustering Data COVID- 19, Jurnal Teknik Industri, Volume 10 No.3.Desember 2020.
[4]. Gao, S., Rao, J., Kang, Y., Liang, Y., & Kruse, J. (2020). Hierarchical Clustering Analysis of COVID-19 Transmission in Wuhan, China. Journal of Medical Virology, 92(9), 1887-1895.
[5]. Sun, Y., Li, Y., Bao, Y., Meng, S., Sun, Y., Schumann-Bischoff, J.,... & Luan, H. (2020). Identifying Links Between SARS-CoV-2 Transmission and Clustered Environments. Journal of Travel Medicine, 27(5), taaa099.
[6]. Liu, L., Wei, Q., Alvarez, X., Wang, H., Du, Y., Zhu, H.,... & Chen, Z. (2020). Epithelial Cells lining Salivary Gland Ducts are Early Target Cells of Severe Acute Respiratory Syndrome Coronavirus Infection in The Upper Respiratory Tracts of Rhesus Macaques. Journal of Virology, 84(15), 765-771.
[7]. Zhang, J., Zhou, L., Yang, Y., Peng, W., Wang, W., Chen, X.,... & Liu, Z. (2020). Therapeutic and Triaging Strategies for 2019 Novel Coronavirus Disease in Fever Clinics. The Lancet Respiratory Medicine, 8(3), e11-e12.
[8]. NetEase. (n.d.). Virus Report. Retrieved from https://wp.m.163.com/163/page/news/virus_report/index.html.
[9]. China Centers for Disease Control and Prevention. (n.d.). Health and Wellness. Retrieved from https://m.chinacdc.cn/xwzx/zxyw/.
[10]. Li Cuixia, Yu Jian. A study on classification of fuzzy clustering algorithm [J]. Journal of Beijing Jiaotong University: Natural Science Edition, 2005, 29(2): 17-21.
[11]. Witten, I. H., & Frank, E. (2005). An Introduction to Data Mining.
[12]. Mitchell, T. M. (1997). Machine Learning. McGraw Hill.
[13]. Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
[14]. Kaufman, L., & Rousseeuw, P. J. (1987). Clustering by means of medoids. Statistical Data Analysis Based on the L1-Norm and Related Methods, 405-416.
[15]. Park, H. S., Jun, C. H., & Park, H. H. (2009). A partitioning around medoids-based clustering algorithm for large-scale data sets. Data Mining and Knowledge Discovery, 18(3), 359-390.
[16]. Arthur,D.,&Vassilvitskii,S. (2007). K-Means++: The advantages of careful seeding. Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms.
[17]. Bahmani,B.,Moseley, B.,Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable K-Means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
[18]. Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics, 3(1), 1-27.
[19]. Nguyen,T.X.,Vo, B.,& Cao, H. (2017). An efficient clustering algorithm for image segmentation. International Journal of Computer Vision, 123(2), 312-328.
Cite this article
Hu,Z. (2024). Comparison of K-Means, K-Medoids and K-Means++ algorithms based on the Calinski-Harabasz index for COVID-19 epidemic in China. Applied and Computational Engineering,49,11-20.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 4th International Conference on Signal Processing and Machine Learning
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).