Research Article
Open access
Published on 26 July 2024
Download pdf
He,S. (2024). Addressing data imbalance in neural network spam detection with insights from SMS spam collection. Theoretical and Natural Science,39,195-201.
Export citation

Addressing data imbalance in neural network spam detection with insights from SMS spam collection

Siyi He *,1,
  • 1 Xiamen University

* Author to whom correspondence should be addressed.

https://doi.org/10.54254/2753-8818/39/20240636

Abstract

In cybersecurity, the persistent challenge of spam detection remains paramount. Traditional methods reliant on human scrutiny or rule-based algorithms are proving inadequate against the constantly evolving tactics employed by spammers. Machine learning emerges as a promising solution, leveraging vast datasets to swiftly and objectively discern patterns and traits within spam messages. By uncovering subtle correlations among message elements, machine learning enhances the precision and efficacy of spam detection systems, offering a dependable and economical approach to combat spam. This study aims to investigate the impact of different strategies for addressing data imbalance on neural network-based spam detection performance. Using the SMS Spam Collection Dataset, four methods for mitigating data imbalance are evaluated against an untreated scenario. Notably, despite inherent data imbalance, the unprocessed scenario exhibits the highest overall performance. Stratified sampling emerges as the most effective technique for accurately identifying spam, while SMOTE excels in preserving legitimate messages (ham) while filtering out spam. These results contribute significantly to peoples’ understanding of the intricate dynamics in controlling data imbalance in spam detection and offer insightful information for future studies and real-world applications.

Keywords

Spam detection, Data imbalance, Neural network, Machine learning

[1]. Ndumiyana D, Magomelo M, Sakala L. Spam detection using a neural network classifier [J]. 2013.

[2]. Chandra A, Suaib M, Beg D R. Web spam classification using supervised artificial neural network algorithms [J]. arXiv preprint arXiv:1502.03581, 2015.

[3]. Sheikhi S, Kheirabadi M T, Bazzazi A. An effective model for SMS spam detection using content-based features and averaged neural network [J]. International Journal of Engineering, 2020, 33(2): 221-228.

[4]. Abayomi-Alli O, Misra S, Abayomi-Alli A, et al. A review of soft techniques for SMS spam classification: Methods, approaches and applications [J]. Engineering Applications of Artificial Intelligence, 2019, 86: 197-212.

[5]. Jain T, Garg P, Chalil N, et al. SMS spam classification using machine learning techniques [C]//2022 12th international conference on cloud computing, data science & engineering (confluence). IEEE, 2022: 273-279.

[6]. Odera D, Odiaga G. A comparative analysis of recurrent neural network and support vector machine for binary classification of spam short message service [J]. World Journal of Advanced Engineering Technology and Sciences, 2023, 9(1): 127-152.

[7]. Jain G, Sharma M, Agarwal B. Spam detection in social media using convolutional and long short term memory neural network [J]. Annals of Mathematics and Artificial Intelligence, 2019, 85(1): 21-44.

[8]. Roy P K, Singh J P, Banerjee S. Deep learning to filter SMS Spam [J]. Future Generation Computer Systems, 2020, 102: 524-533.

[9]. Abayomi‐Alli O, Misra S, Abayomi‐Alli A. A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset [J]. Concurrency and Computation: Practice and Experience, 2022, 34(17): e6989.

[10]. Chandra A, Khatri S K. Spam SMS filtering using recurrent neural network and long short term memory [C]//2019 4th international conference on information systems and computer networks (ISCON). IEEE, 2019: 118-122.

[11]. Wei F, Nguyen T. A lightweight deep neural model for SMS spam detection [C]//2020 International Symposium on Networks, Computers and Communications (ISNCC). IEEE, 2020: 1-6.

Cite this article

He,S. (2024). Addressing data imbalance in neural network spam detection with insights from SMS spam collection. Theoretical and Natural Science,39,195-201.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2nd International Conference on Mathematical Physics and Computational Simulation

Conference website: https://www.confmpcs.org/
ISBN:978-1-83558-463-7(Print) / 978-1-83558-464-4(Online)
Conference date: 9 August 2024
Editor:Anil Fernando
Series: Theoretical and Natural Science
Volume number: Vol.39
ISSN:2753-8818(Print) / 2753-8826(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).