
Application and analysis of text similarity in text clustering in the Chinese context
- 1 Shaanxi University of Science & Technology
* Author to whom correspondence should be addressed.
Abstract
With the development of the Internet, information sharing is higher, and the amount of information that each user is exposed to is increasing. How to find the information peoples want from so much information is a very important question. The vast majority of these resources are related to textual information. The most intuitive manifestation of these problems is that when people usually use search engines, enter a piece of text, and search out the relevant website, if the algorithm is not good, the search results will be very unsatisfactory. Therefore, this paper studies the application of text similarity in text clustering in the Chinese context. First, the basic concept of text similarity is introduced. In addition, text clustering is explained/explained from three aspects: definition, application, and general processing process. Secondly, combined with the existing data, some mainstream clustering algorithms are comprehensively summarized. Then, combined with the above content, the similarity calculation method in text clustering is analyzed. Finally, the above methods are compared and analyzed according to the experimental results in the Python environment.
Keywords
text similarity, text clustering, artificial intelligence
[1]. Bao, J. , et al. "Comparing Different Text Similarity Methods." university of hertfordshire (2007).
[2]. Shen, M. , et al. "A Review Expert Recommendation Method Based on Comprehensive Evaluation in Multi-Source Data." CCEAI 2021: 5th International Conference on Control Engineering and Artificial Intelligence 2021.
[3]. Guyon, Isabelle M , Andr, and Elisseeff. "An introduction to variable and feature selection." The Journal of Machine Learning Research (2003).
[4]. Wilson, H. G. , B. Boots , and A. A. Millward . "A comparison of hierarchical and partitional clustering techniques for multispectral image classification." Geoscience and Remote Sensing Symposium, 2002. IGARSS '02. 2002 IEEE International IEEE, 2002.
[5]. George, et al. “protocol, F-Measure, and Reliability in Information Retrieval.” The Journal of the American Medical Informatics Association (2005).
[6]. Elhewy, A. H. , E. Mesbahi , and Y. Pu . "Reliability analysis of structures using neural network method." Probabilistic Engineering Mechanics 21.1(2006):44-53.
[7]. Kohonen, T. . "The self-organizing map." Proceedings of the IEEE 78.9(2002):1464-1480.
[8]. Yu, Y. , and L. Wang . "A Novel Similarity Calculation Method Based on Chinese Sentence Keyword Weight." Journal of Software 9.5(2014):1151-1156.
[9]. Wang, X. Z. , and H. C. Shu . "Construction of Fuzzy Similar Matrix." Journal of Jishou University (2003).
[10]. Jia, X. , and H. Fang . "Business Flow Analysis Method Based on User Behavior Similarity of Web Logs." Journal of Yangtze University(Natural Science Edition) (2018).
Cite this article
Fan,W. (2023). Application and analysis of text similarity in text clustering in the Chinese context. Applied and Computational Engineering,21,71-77.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 5th International Conference on Computing and Data Science
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).