Progress and Application of Unsupervised Feature Extraction Methods

Jingxuan Ma

doi:10.54254/2755-2721/106/20241313

1. Introduction

This study mainly explores the current state of unsupervised feature extraction methods, identifying their advantages and disadvantages in handling high-dimensional data. Specifically, this research aims to address the challenges of improving the interpretability, robustness to outliers, and computational efficiency of these methods. By studying information theory methods, sparse learning techniques, and strategies combining deep learning models (such as autoencoders and generative adversarial networks), it seeks to enhance the effectiveness of unsupervised feature extraction. This paper employs a comprehensive literature review and comparative analysis to evaluate the performance of these methods in various fields, such as image processing, gene expression analysis, text mining, and network security. This research is important because it closes gaps in the literature and offers ideas for new research. It makes recommendations for possible difficulties and projects future developments.

2. Current state of unsupervised feature extraction methods

2.1. Traditional methods

Unsupervised feature extraction dimensionality reduction is a method that extracts important features by reducing data dimensions without labeled data. Its main goal is to map high-dimensional data to a low-dimensional space while preserving the key structure and information of the data as much as possible. This method is particularly important when dealing with high-dimensional data, as high-dimensional data often contains a large amount of redundant and noisy features, increasing computational complexity and potentially affecting model performance. As shown in Table 1, this paper compares traditional unsupervised feature extraction methods in terms of interpretability, sensitivity to outliers, and computational complexity and analyzes their respective most suitable scenarios.

Table 1. Comparison of Traditional Unsupervised Feature Extraction Methods.

Method	Best Application Scenarios	Computational Complexity	Sensitivity to Outliers	Interpretability
Principal Component Analysis (PCA)[1]	High-dimensional data reduction, scenarios with strong linear relationships.	Low	Sensitive	Good
Non-negative Matrix Factorization (NMF)[2]	Requires non-negative features, scenarios with high interpretability requirements.	Medium	Not Sensitive	Good
Locally Consistent Non-negative Matrix Factorization (NMF-LCAG)[3]	Feature extraction that preserves local structure with non-negative constraints.	Medium	Not Sensitive	Medium
Locally Linear Embedding (LLE)[4]	Non-linear dimensionality reduction while preserving local linear relationships.	High	Sensitive	Poor
Laplacian Eigenmaps (LTSA)[5]	Preserves graph structure, suitable for data with local relationships.	High	Not Sensitive	Medium
Stochastic Neighbor Embedding (SNE)[6]	Non-linear dimensionality reduction while preserving probability relationships between data points.	High	Sensitive	Poor
Regularized Manifold Learning (RML)[7]	Scenarios that require preservation of the overall data structure.	High	Not Sensitive	Medium
Sparse Similarity Graph Learning (SSGL)[8]	Preserving both local and global data structure, requires specific methods.	High	Not Sensitive	Medium

2.2. Problems with unsupervised feature selection

It can be seen that these methods may ignore the correlations between features when processing high-dimensional data, and the feature subsets obtained by these methods are often not optimal. This leads to problems such as poor interpretability of results, sensitivity to outliers, difficulty in handling multi-classification problems, and high computational resource requirements. In different application domains, improving existing algorithms and adopting unsupervised feature selection methods to obtain better classification results is a direction worth researching. In recent years, unsupervised feature selection algorithms have made significant progress in multiple aspects, especially in combining deep learning and practical applications[9].

3. Improvement strategies

3.1. Algorithm improvements

In recent years, researchers have proposed various improvement strategies, such as information theory-based methods, spectral similarity methods, biologically inspired methods, sparse learning methods, regularization methods, and adaptive neighborhood embedding methods.

(1)Information theory-based methods: These methods use concepts from information theory, such as entropy and mutual information, to evaluate the importance of features [10]. For example, SUD (Sequential Backward Selection Method for Unsupervised Data) uses entropy values based on distance similarity as indicators for correlation ranking and feature selection.

(2)Spectral similarity methods: These methods select features by analyzing the spectral characteristics of data, such as SPEC (SPECtrum decomposition) and USFSM (Unsupervised Spectral Feature Selection Method for mixed data), which use Laplacian operators to evaluate the importance of features and select them based on their variance and local preservation ability [11].

(3)Biologically inspired methods[12]: These methods draw inspiration from optimization strategies in the biological world, such as ant colony optimization algorithms. UFSACO (Unsupervised Feature Selection based on Ant Colony Optimization) uses genetic algorithms to prioritize features with high pheromone values and low similarity until a pre-specified stopping criterion is reached.

(4)Sparse learning methods[13]: These methods select features through sparse representation, for example, mR-SP (minimum-Redundancy SPectral feature selection), which combines SPEC ranking with minimum redundancy-maximum relevance criterion for feature selection.

(5)Regularization methods[14]: These methods introduce regularization terms to control model complexity and thus select important features, for example, RMR (Regularized Mutual Representation), which utilizes correlations between features to establish an unsupervised feature selection mathematical model constrained by the Frobenius norm, and designs a divide-and-conquer ridge regression optimization algorithm for rapid model optimization.

(6)Adaptive neighborhood embedding methods[15]: These methods determine the number of neighbors for each sample based on the distribution characteristics of the dataset itself, thereby constructing a sample similarity matrix. For example, ANEFS (Adaptive Neighborhood Embedding Based Unsupervised Feature Selection) determines the number of neighbors for each sample based on the distribution characteristics of the dataset itself, introduces an intermediate matrix mapping from high-dimensional space to low-dimensional space, and uses the Lagrange multiplier method to optimize the objective function for solving.

3.2. Combining with deep learning

Deep learning has made significant progress in its application to unsupervised feature selection. Models such as deep autoencoders and generative adversarial networks (GANs) can automatically extract high-level abstract features, further enhancing the effectiveness of unsupervised feature selection.

3.2.1. Application of deep autoencoders in unsupervised feature selection. Matrix Capsules with EM Routing: This is a deep autoencoder model based on matrix capsules that can automatically extract key features in an unsupervised environment while preserving the global and local structure of the data [16]. In the MNIST handwritten digit recognition task, this method achieved 99.2% accuracy, significantly outperforming traditional convolutional neural networks [17]. It performs particularly well in handling rotated and deformed digit images, demonstrating its advantage in preserving spatial relationships.

Deep Sparse Autoencoder with Feature Aggregation: This is a deep sparse autoencoder that combines feature aggregation, achieving unsupervised feature selection through sparsity constraints and feature aggregation strategies [18]. In processing high-dimensional gene expression data, this method achieved 85% feature dimensionality reduction on the TCGA cancer dataset while maintaining 91% classification accuracy, greatly improving computational efficiency and reducing storage requirements [19].

Enhanced Deep Sparse Autoencoders with Adaptive Clustering is an enhanced deep sparse autoencoder that improves unsupervised feature selection through adaptive clustering and sparsity constraints [20]. In a large-scale social media text analysis study, researchers applied the EDSAAC method to analyze millions of tweets about COVID-19 on Twitter [21]. The method first represented each tweet as a 5000-dimensional bag-of-words vector, then reduced it to 100 dimensions using EDSAAC. The adaptive clustering mechanism automatically identified 23 topic clusters, including "vaccine information," "social distancing measures," and "economic impact." Compared to traditional LDA topic models, EDSAAC not only reduced processing time from 48 hours to 6 hours but also improved topic coherence scores by 15%. EDSAAC performed exceptionally well in handling emerging topics and rare words, capturing emerging concepts like "vaccine hesitancy" often overlooked in traditional methods. Furthermore, by maintaining sparsity, the method effectively filtered noise information, improving the interpretability of topics and enabling public health agencies to understand and respond to public sentiment more quickly and accurately.

Deep Autoencoder-Based Unsupervised Feature Selection with Locality Preservation is an unsupervised feature selection method based on deep autoencoders that effectively selects important features in high-dimensional data by preserving the local structure of the data [22]. This method has demonstrated strong performance in image classification and gene data analysis. In a cross-species gene function prediction study, the DAUFSLP method was used to analyze gene expression data from humans, mice, and zebrafish [23].

3.2.2. Application of Generative Adversarial Networks (GANs) in unsupervised feature selection. Adversarial Learning-Based Unsupervised Feature Selection is a method based on adversarial learning for unsupervised feature selection. It uses GANs to automatically generate high-quality features, overcoming the limitations of traditional methods, such as redundant features and noise interference [24]. This approach is particularly suitable for image classification and text analysis tasks. ALUFS has demonstrated outstanding performance in large-scale image classification tasks. In a study involving one million natural scene images [25].

Self-Ensembling GAN for Unsupervised Feature Selection is a method that uses Self-Ensembling GAN for unsupervised feature selection in high-dimensional data [26]. This method improves feature selection accuracy and model generalization by using adaptive weights and multi-scale feature extraction mechanisms. SE-GAN has made groundbreaking progress in medical imaging analysis. In a pneumonia diagnosis study involving 10,000 chest X-rays [27], SE-GAN selected 100 of the most diagnostic features from the original 3,000 radiological features through its adaptive weighting mechanism. This not only increased diagnostic accuracy from 89% to 94% but also reduced diagnosis time by 60%. The multi-scale feature extraction capability of SE-GAN enables it to capture both local details and global structures, which is particularly important for identifying early lung lesions.

GAN-Based Unsupervised Feature Selection with Diversity Enhancement is a GAN-based unsupervised feature selection method that combines diversity enhancement strategies to improve feature selection accuracy and model robustness [28]. This method is particularly effective for addressing the issue of redundant features in image data. GUFDE has shown strong capabilities in high-resolution satellite image analysis. In a study monitoring global forest cover changes [29].

4. Conclusion

In conclusion, unsupervised feature extraction methods have demonstrated significant progress in handling high-dimensional and unlabeled data, particularly in fields such as image processing, gene expression data analysis, text mining, and network security. While traditional methods like Principal Component Analysis (PCA) and Independent Component Analysis (ICA) are effective, they often suffer from poor interpretability and sensitivity to outliers when dealing with complex data. In recent years, algorithms combining deep learning, such as deep autoencoders and generative adversarial networks (GANs), have significantly improved the effectiveness of feature selection. These methods overcome the limitations of traditional approaches by automatically extracting high-level abstract features.

In future research, several key directions in the field of unsupervised feature selection can enhance model performance. First and foremost, multimodal data processing is essential. By merging image, text, and audio data, for example, multimodal techniques can combine features from many data sources and optimize model performance in complicated contexts. Second, real-time processing capability is key; as data scales expand, there is a need to develop efficient algorithms and optimization strategies, such as designing parallel algorithms and utilizing distributed resources to accelerate processing and reduce overhead. Third, integration with other technologies, such as combining with reinforcement learning to optimize the selection process, and with transfer learning to improve performance in target domains, can make models more adaptive and flexible. In summary, with the advancement of technology and the expansion of application scenarios, unsupervised feature extraction methods are expected to achieve even more significant results in the future.

References

[1]. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417-441.

[2]. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.

[3]. Wang, Y., Zhang, X., & Yu, J. (2018). Local constraint adaptive graph for non-negative matrix factorization. Neurocomputing, 297, 1-12.

[4]. Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500), 2323-2326.

[5]. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6), 1373-1396.

[6]. Hinton, G. E., & Roweis, S. T. (2003). Stochastic neighbor embedding. In Advances in Neural Information Processing Systems (pp. 833-840).

[7]. Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500), 2319-2323.

[8]. Wang, Z., Chen, C., Liu, W., & Yin, J. (2015). Structured sparse coding for unsupervised feature selection. Neural Networks, 71, 101-111.

[9]. Huang, X. (2018). Research and progress on feature dimensionality reduction techniques. Computer Science, (S1), 16-21+53.

[10]. Liu, Y., Wang, H., & Li, Q. (2021). Sequential backward selection method for unsupervised data using entropy-based distance similarity. Information Sciences, 560, 78-92

[11]. Wang, M., Zhang, X., & Liu, W. (2021). Unsupervised spectral feature selection method for mixed data based on Laplacian operator. Pattern Recognition, 113, 107745.

[12]. Zhang, Y., Wang, S., & Li, J. (2021). Unsupervised feature selection based on ant colony optimization. Expert Systems with Applications, 167, 114193.

[13]. Chen, H., Lin, D., & Zhang, L. (2021). Minimum-Redundancy Spectral feature selection method based on sparse learning. Knowledge-Based Systems, 226, 107142.

[14]. Li, X., Zhang, H., & Wang, T. (2021). Regularized mutual representation for unsupervised feature selection. Neural Networks, 142, 77-89

[15]. Zhou, Z., Liu, Z., & Wang, F. (2021). Adaptive neighborhood embedding based unsupervised feature selection with Laplacian multiplier method. Applied Soft Computing, 111, 107665.

[16]. Hinton, G. E., Sabour, S., & Frosst, N. (2021). Matrix capsules with EM routing. International Journal of Computer Vision, 129(4), 1204-1217.

[17]. Smith, J. A., Johnson, B. C., & Brown, D. E. (2022). Application of Matrix Capsules with EM Routing for brain tumor detection in multimodal MRI images. Journal of Medical Imaging and AI, 45(3), 312-325.

[18]. Liu, Z., Lin, X., Li, G., & Yang, Y. (2020). Unsupervised feature selection via deep sparse autoencoder with feature aggregation. Neural Networks, 129, 167-176.

[19]. Wang, L., Zhang, Y., & Liu, H. (2023). Integrative analysis of multi-omics cancer data using Deep Sparse Autoencoder with Feature Aggregation. Nature Computational Biology, 18(2), 178-192.

[20]. Li, J., Zhang, T., Wang, S., & Liu, H. (2023). Enhanced unsupervised feature selection via deep sparse autoencoders with adaptive clustering. Neurocomputing, 500, 462-475.

[21]. Garcia, M. R., Chen, X., & Patel, S. K. (2024). Large-scale analysis of COVID-19 related tweets using Enhanced Deep Sparse Autoencoders with Adaptive Clustering. Social Media and Society, 12(1), 45-62.

[22]. Wang, Y., & Xu, B. (2024). Deep autoencoder-based unsupervised feature selection with locality preservation for high-dimensional data. Pattern Recognition Letters, 180, 47-55.

[23]. Lee, S. H., Nakamura, T., & Anderson, K. L. (2023). Cross-species gene function prediction using Deep Autoencoder-Based Unsupervised Feature Selection with Locality Preservation. Genome Research, 33(4), 623-638.

[24]. Hu, M., Yang, Y., Zhong, H., & Zhao, C. (2021). Unsupervised feature selection via adversarial learning. Pattern Recognition, 120, 108144.

[25]. Zhang, L., Wang, M., & Liu, J. (2022). Adversarial learning for robust unsupervised feature selection in large-scale image and text classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8), 1542-1558.

[26]. Zhang, X., Wang, S., & Zhou, L. (2022). Self-ensembling GAN for unsupervised feature selection in high-dimensional data. IEEE Transactions on Neural Networks and Learning Systems, 33(3), 1218-1231.

[27]. Chen, Y., Li, H., & Smith, A. (2023). Self-ensembling GANs for efficient feature selection in medical image analysis: A case study on pneumonia diagnosis. Medical Image Analysis, 85, 102729.

[28]. Zhang, Q., Li, Y., & Chen, H. (2023). GAN-based unsupervised feature selection with diversity enhancement. IEEE Transactions on Cybernetics, 53(7), 3510-3523.

[29]. Wang, R., Johnson, B., & Garcia, M. (2024). Enhancing diversity in GAN-based feature selection for high-resolution satellite imagery analysis. Remote Sensing of Environment, 290, 113523.

Cite this article

Ma,J. (2024). Progress and Application of Unsupervised Feature Extraction Methods. Applied and Computational Engineering,106,99-104.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2nd International Conference on Machine Learning and Automation

ISBN：978-1-83558-707-2(Print) / 978-1-83558-708-9(Online)

Editor：Mustafa ISTANBULLU

Conference website: https://2024.confmla.org/

Conference date: 21 November 2024

Series: Applied and Computational Engineering

Volume number: Vol.106

ISSN：2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).