Evaluation of dimensionality reduction and unsupervised clustering methods in breast datasets

Research Article
Open access

Evaluation of dimensionality reduction and unsupervised clustering methods in breast datasets

JunFan Liu 1* , Weide He 2 , YuXin Wang 3 , BoYi Zhang 4
  • 1 Kanazawa University    
  • 2 Nanjing University of Information Science and Technology    
  • 3 Hangzhou Xuejun High School, Hangzhou, China    
  • 4 Wuhan Britain-China School, Wuhan, China    
  • *corresponding author liujunfan@stu.kanazawa-u.ac.jp
Published on 31 January 2024 | https://doi.org/10.54254/2755-2721/31/20230153
ACE Vol.31
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-83558-287-9
ISBN (Online): 978-1-83558-288-6

Abstract

This paper delves into the issues related to handling high-dimensional data in massive datasets, such as computational challenges and uneven data distribution owing to diminished data point density. Various dimensionality reduction techniques such as Principal Component Analysis (PCA), Kernel Principal Component Analysis (KPCA), and Diffusion Maps are discussed and evaluated for their efficiency in extracting crucial data features. This aids in gaining a comprehensive understanding of the data. The study also examines unsupervised clustering methods like K-means, DBSCAN, and spectral clustering. By integrating these clustering methods with dimensionality reduction techniques, we aim to uncover potential synergies. The principles and methodology behind spectral clustering and unsupervised nonlinear diffusion learning are further dissected. Various datasets are employed to evaluate the efficiency of these techniques empirically. The final section of the paper comprises an evaluation of the clustering results and a discussion on potential avenues for future research.

Keywords:

Dimensionality Reduction, Unsupervised Clustering, Machine Learning

Liu,J.;He,W.;Wang,Y.;Zhang,B. (2024). Evaluation of dimensionality reduction and unsupervised clustering methods in breast datasets. Applied and Computational Engineering,31,218-228.
Export citation

References

[1]. Ronald R Coifman and St´ephane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.

[2]. Per-Erik Danielsson. Euclidean distance mapping. Computer Graphics and image processing, 14(3):227–248, 1980.

[3]. Karl J Friston. Functional and effective connectivity: a review. Brain connectivity, 1(1):13–36, 2011.

[4]. Haibin Ling and Kazunori Okada. Diffusion distance for histogram comparison. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 246–253. IEEE, 2006.

[5]. Mohiuddin Ahmed, Raihan Seraj, and Syed Mohammed Shamsul Islam. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8):1295, 2020.

[6]. K Krishna and M Narasimha Murty. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433–439, 1999.

[7]. David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In Proceedings of the twenty-second annual symposium on Computational geometry, pages 144–153, 2006.

[8]. Miguel A Carreira-Perpin´an. A review of mean-shift algorithms for clustering. arXiv preprint arXiv:1503.00687, 2015.

[9]. Michael Hahsler, Matthew Piekenbrock, and Derek Doran. dbscan: Fast density-based clustering with r. Journal of Statistical Software, 91:1–30, 2019.

[10]. Kamran Khan, Saif Ur Rehman, Kamran Aziz, Simon Fong, and Sababady Sarasvady. Dbscan: Past, present and future. In The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pages 232–238. IEEE, 2014.

[11]. Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.

[12]. Seyoung Park and Hongyu Zhao. Spectral clustering based on learning similarity matrix. Bioinformatics, 34(12):2069–2076, 2018.

[13]. Marco Di Summa, Andrea Grosso, and Marco Locatelli. Branch and cut algorithms for detecting critical nodes in undirected graphs. Computational Optimization and Applications, 53:649–680, 2012.

[14]. A Cantoni and P Butler. Eigenvalues and eigenvectors of symmetric centrosymmetric matrices. Linear Algebra and its Applications, 13(3):275–288, 1976.


Cite this article

Liu,J.;He,W.;Wang,Y.;Zhang,B. (2024). Evaluation of dimensionality reduction and unsupervised clustering methods in breast datasets. Applied and Computational Engineering,31,218-228.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Machine Learning and Automation

ISBN:978-1-83558-287-9(Print) / 978-1-83558-288-6(Online)
Editor:Mustafa İSTANBULLU
Conference website: https://2023.confmla.org/
Conference date: 18 October 2023
Series: Applied and Computational Engineering
Volume number: Vol.31
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Ronald R Coifman and St´ephane Lafon. Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30, 2006.

[2]. Per-Erik Danielsson. Euclidean distance mapping. Computer Graphics and image processing, 14(3):227–248, 1980.

[3]. Karl J Friston. Functional and effective connectivity: a review. Brain connectivity, 1(1):13–36, 2011.

[4]. Haibin Ling and Kazunori Okada. Diffusion distance for histogram comparison. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 246–253. IEEE, 2006.

[5]. Mohiuddin Ahmed, Raihan Seraj, and Syed Mohammed Shamsul Islam. The k-means algorithm: A comprehensive survey and performance evaluation. Electronics, 9(8):1295, 2020.

[6]. K Krishna and M Narasimha Murty. Genetic k-means algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 29(3):433–439, 1999.

[7]. David Arthur and Sergei Vassilvitskii. How slow is the k-means method? In Proceedings of the twenty-second annual symposium on Computational geometry, pages 144–153, 2006.

[8]. Miguel A Carreira-Perpin´an. A review of mean-shift algorithms for clustering. arXiv preprint arXiv:1503.00687, 2015.

[9]. Michael Hahsler, Matthew Piekenbrock, and Derek Doran. dbscan: Fast density-based clustering with r. Journal of Statistical Software, 91:1–30, 2019.

[10]. Kamran Khan, Saif Ur Rehman, Kamran Aziz, Simon Fong, and Sababady Sarasvady. Dbscan: Past, present and future. In The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014), pages 232–238. IEEE, 2014.

[11]. Andrew Ng, Michael Jordan, and Yair Weiss. On spectral clustering: Analysis and an algorithm. Advances in neural information processing systems, 14, 2001.

[12]. Seyoung Park and Hongyu Zhao. Spectral clustering based on learning similarity matrix. Bioinformatics, 34(12):2069–2076, 2018.

[13]. Marco Di Summa, Andrea Grosso, and Marco Locatelli. Branch and cut algorithms for detecting critical nodes in undirected graphs. Computational Optimization and Applications, 53:649–680, 2012.

[14]. A Cantoni and P Butler. Eigenvalues and eigenvectors of symmetric centrosymmetric matrices. Linear Algebra and its Applications, 13(3):275–288, 1976.