
AI-generated text detection and classification based on BERT deep learning algorithm
- 1 Beijing Yuandian Technology Inc.
- 2 2 Independent researcher, Beijing, 102200, China
- 3 Nio Inc., Shanghai, 201805, Chin
* Author to whom correspondence should be addressed.
Abstract
With the rapid development and wide application of deep learning technology, AI-generated text detection plays an increasingly important role in various fields. In this study, we developed an efficient AI-generated text detection model based on the BERT algorithm, which provides new ideas and methods for solving related problems. In the data preprocessing stage, a series of steps were taken to process the text, including operations such as converting to lowercase, word splitting, removing stop words, stemming extraction, removing digits, and eliminating redundant spaces, to ensure data quality and accuracy. By dividing the dataset into a training set and a test set in the ratio of 60% and 40%, and observing the changes in the accuracy and loss values during the training process, we found that the model performed well during the training process. The accuracy increases steadily from the initial 94.78% to 99.72%, while the loss value decreases from 0.261 to 0.021 and converges gradually, which indicates that the BERT model is able to detect AI-generated text with high accuracy and the prediction results are gradually approaching the real classification results. Further analysis of the results of the training and test sets reveals that in terms of loss value, the average loss of the training set is 0.0565, while the average loss of the test set is 0.0917, showing a slightly higher loss value. As for the accuracy, the average accuracy of the training set reaches 98.1%, while the average accuracy of the test set is 97.71%, which is not much different from each other, indicating that the model has good generalisation ability. In conclusion, the AI-generated text detection model based on the BERT algorithm proposed in this study shows high accuracy and stability in experiments, providing an effective solution for related fields. In the future, the model performance can be further optimised and its potential for application in a wider range of fields can be explored to promote the development and application of AI technology in the field of text detection.
Keywords
AI-Generated Text Detection, BERT, Average accuracy
[1]. Lancaster, Thomas. “Artificial intelligence, text generation tools and ChatGPT–does digital watermarking offer a solution?.” International Journal for Educational Integrity 19.1 (2023): 10.
[2]. Fu, Yu, Deyi **ong, and Yue Dong. “Watermarking conditional text generation for ai detection: Unveiling challenges and a semantic-aware watermark remedy.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 16. 2024.
[3]. Emenike, Mary E., and Bright U. Emenike. “Was this title generated by ChatGPT? Considerations for artificial intelligence text-generation software programs for chemists and chemistry educators.” Journal of Chemical Education 100.4 (2023): 1413-1418.
[4]. Sadasivan, Vinu Sankar, et al. “Can AI-generated text be reliably detected?.” arxiv preprint arxiv:2303.11156 (2023).
[5]. Anderson, Nash, et al. “AI did not write this manuscript, or did it? Can we trick the AI text detector into generated texts? The potential future of ChatGPT and AI in Sports & Exercise Medicine manuscript generation.” BMJ open sport & exercise medicine 9.1 (2023): e001568.
[6]. Rathore, Bharati. “Future of AI & generation alpha: ChatGPT beyond boundaries.” Eduzone: International Peer Reviewed/Refereed Multidisciplinary Journal 12.1 (2023): 63-68.
[7]. Ma, Yongqiang, et al. “AI vs. Human--Differentiation Analysis of Scientific Content Generation.” arxiv preprint arxiv:2301.10416 (2023).
[8]. Zhang, Hanqing, et al. “A survey of controllable text generation using transformer-based pre-trained language models.” ACM Computing Surveys 56.3 (2023): 1-37.
[9]. Cao, Yihan, et al. “A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt.” arxiv preprint arxiv:2303.04226 (2023).
[10]. Fitria, Tira Nur. “Artificial intelligence (AI) technology in OpenAI ChatGPT application: A review of ChatGPT in writing English essay.” ELT Forum: Journal of English Language Teaching. Vol. 12. No. 1. 2023.
Cite this article
Wang,H.;Li,J.;Li,Z. (2024). AI-generated text detection and classification based on BERT deep learning algorithm. Theoretical and Natural Science,39,311-316.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 2nd International Conference on Mathematical Physics and Computational Simulation
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).