Using data resampling and category weight adjustment to solve sample imbalance

Research Article
Open access

Using data resampling and category weight adjustment to solve sample imbalance

Zimeng Leng 1*
  • 1 Harbin Cambridge University    
  • *corresponding author 1078767666@qq.com
Published on 21 February 2024 | https://doi.org/10.54254/2755-2721/39/20230574
ACE Vol.39
ISSN (Print): 2755-273X
ISSN (Online): 2755-2721
ISBN (Print): 978-1-83558-303-6
ISBN (Online): 978-1-83558-304-3

Abstract

The purpose of this thesis is to investigate the application of artificial intelligence and machine learning in solving the sample imbalance problem. The sample imbalance problem refers to the phenomenon that the number of different categories of samples in the training data varies greatly, resulting in the poor performance of traditional machine learning algorithms on a few categories of samples. To address this problem, this paper proposes a new approach combining data resampling and category weight adjustment strategies. First, the sample distribution of the dataset is adjusted by undersampling and oversampling techniques to balance the number of samples from different categories. Then, during the model training process, different weights are assigned to the samples of different categories so that the model pays more attention to the samples of a few categories. The experimental results show that the method achieves significant performance improvement on multiple datasets. In addition, this paper compares other commonly used methods for solving the sample imbalance problem and analyzes and discusses them in detail. Finally, this study offers a practical solution to the problem of sample imbalance and provides guidance for research in related fields.

Keywords:

Artificial Intelligence, Machine Learning, Sample Imbalance, Data Resampling, Category Weight Adjustment

Leng,Z. (2024). Using data resampling and category weight adjustment to solve sample imbalance. Applied and Computational Engineering,39,31-37.
Export citation

1. Introduction

Artificial Intelligence (AI) and Machine Learning is the study of how to equip computer systems with intelligence and learning capabilities to enable autonomous processing and problem-solving. Specifically, the research on Artificial Intelligence and Machine Learning includes the following aspects:

1. Machine learning algorithms: research and development of various machine learning algorithms, such as supervised learning, unsupervised learning [1], reinforcement learning, etc., to enable computers to learn and infer patterns, laws and knowledge from data.

2. Deep learning: research and application of deep neural network models to enable the characterization and analysis of complex data, such as images, speech, and natural language [2], through multi-level neuron connectivity and learning.

3. Natural language processing: research on how to enable computers to understand and process human language, including semantic understanding, sentiment analysis, machine translation, question and answer systems, etc.

4. Computer vision: research on how to enable computers to understand and process image and video data, including target detection, image classification, face recognition, image generation, etc.

5. Recommendation systems: Research how to use machine learning and personalization algorithms to provide personalized recommendations and suggestions to users based on their interests and behaviors, such as e-commerce, social media, etc.

6. Reinforcement Learning: Research how to allow computers to learn by interacting with the environment and achieve intelligent decisions and behaviors through trial and error and reward/punishment mechanisms, e.g., autonomous driving, game strategies, etc.

7. Robot Learning: Research and development of machine learning algorithms that enable robots to perceive [3], learn and adapt from their environment to achieve autonomous decision-making and behavior control.

Research in artificial intelligence and machine learning aims to improve the intelligence level and autonomous decision-making ability of computer systems, enabling them to automate processing and problem-solving in a variety of application areas. Through continuous research and innovation, the application areas of AI and machine learning are expanding to cover a wide range of fields, such as medical diagnosis, financial forecasting, transportation control, and intelligent manufacturing.

Research Gap: Deep Adversarial Networks for Small Sample Learning

In artificial intelligence and machine learning, Deep Adversarial Networks have achieved remarkable success in image generation and transformation. However, there is still a research gap in the application of Deep Adversarial Networks to small-sample learning.

Small sample learning refers to the task of model training and prediction with only a few labelled samples. In practical applications, data in many fields (e.g., medical diagnosis, industrial defect detection, etc.) is often minimal, so machine learning models for small sample learning must be developed.

Deep adversarial generative networks can generate synthetic data with high quality from a small number of samples using adversarial training, thus expanding the size of the training set. However, there are still some challenges and problems for the application of deep adversarial generative networks in small sample learning, which need to be solved by further research.

2. Literature Review

1.1950s-1960s: The founding period of artificial intelligence. During this period, researchers began to propose methods for modelling human intelligence using logical reasoning and symbolic processing, such as logical reasoning and expert systems. Although these methods achieved some results in some fields, they could not cope with complex real-world problems due to the limitations of symbolic processing.

2.1980s: The rise of knowledge symbolism. Researchers believed that human intelligence could be achieved through symbolic reasoning and knowledge representation. During this period, expert systems were widely studied and applied. However, expert systems relied on manually written rules and knowledge bases and faced problems such as difficulty in knowledge acquisition and incomplete knowledge representation, leading to limited applications.

3.1990s: The rise of machine learning. With the development of computers and the massive accumulation of data, people began to try to use machine learning algorithms to allow computers to learn from data and make decisions. During this period, machine learning algorithms such as Support Vector Machines, Decision Trees, and Neural Networks were proposed and widely used [4]. By learning patterns and laws from data, machine learning enables computers to automatically make decisions and predictions from data, greatly expanding the application areas of artificial intelligence.

4.2000s to 2010s: The rise of deep learning. With the increase in computing power and the emergence of large-scale datasets, deep learning began to make major breakthroughs in areas such as image recognition, speech recognition, and natural language processing. Deep learning networks such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) were widely studied and applied. The success of deep learning lies in the ability to automatically extract high-level feature representations from raw data through a multi-layered neural network structure, which significantly improve the performance of artificial intelligence.

2.1. COMPREHENSIVE ASSESSMENT

The research development lineage of AI and machine learning shows that from symbolic reasoning to machine learning to deep learning, researchers have continuously explored new methods and techniques to drive the advancement of AI technology. The introduction of machine learning has enabled computers to learn from data, while deep learning has enabled deeper feature learning and representation through the construction of multi-level neural networks. The development of AI technologies has resulted in breakthroughs in image recognition, speech recognition, and natural language processing.

2.2. CRITICAL ASSESSMENT

Despite the remarkable achievements of AI and machine learning in various fields, a number of challenges and limitations remain. For example, machine learning and deep learning algorithms have a high demand for large amounts of labelled data and stringent requirements for data quality and labelling accuracy. In addition, black-boxing is an issue, and the decision-making process of deep learning models is often difficult to explain and lacks interpretability. In addition, data privacy and ethical issues need to be addressed and resolved.

3. Methodology

In this study, we employ a strategy that combines data resampling and category weight adjustment to address the sample imbalance problem. First, we adjust the sample distribution in the dataset through undersampling and oversampling techniques to balance the number of samples from different categories. We used the SMOTE algorithm for data resampling to increase the sample size of a few categories by generating synthetic samples. At the same time, we use a category weight adjustment strategy to increase the weights of a few categories so that the model focuses more on these samples.

In undersampling, we randomly remove some samples of the majority category, thus reducing their proportion in the dataset. Suppose the number of samples in the majority category is Nmajor, and the number of samples in the minority category is Nminor. We randomly select Nmajor- Nminor samples in the majority category to be deleted.

In oversampling, we increase the proportion of minority category samples in the dataset by replicating or manually generating some of them. Suppose we need to generate Ngen new minority category samples. We use a technique called SMOTE (Synthetic Minority Over-sampling Technique) to generate new minority category samples. Specifically, for each new sample to be generated, we randomly select a minority category sample Xminor and then randomly select a sample Xnn among its k nearest neighbors.The new sample Xgen is generated by the following formula:

Xgen= Xminor+ λ (Xnn - Xminor)

where λ is a random number ranging between [0, 1].

Then, during the model training process, we assign different weights to the samples of different categories so that the model focuses more on the samples of a few categories [5]. Specifically, we set the weight ωc of category c as the inverse of its proportion of samples in the dataset:

ωc = 1/Ne / N

Where N c is the number of samples in category c, and N is the total number of samples.

To verify the effectiveness of our method, we conducted experiments on multiple datasets, including artificially generated datasets and real-world datasets. The experimental results show that our method outperforms other commonly used methods for solving the sample imbalance problem in terms of evaluation metrics such as accuracy, recall, and F1 score.

In the data analysis phase, we first performed a statistical analysis of the dataset to understand the sample size and distribution of each category. Through visual analysis, we could observe the obvious imbalance between different categories. We then used undersampling and oversampling techniques to adjust the distribution of samples in the dataset. In this way, we can balance the number of samples in each category.

During model training, we also adjust the category weights according to the importance of each category. Typically, we increase the weights of a few categories so that the model pays more attention and focuses on the classification accuracy of these categories.

To assess the validity of our research methodology, we used a series of evaluation metrics such as accuracy, precision, recall, and F1 score. We also conducted cross-validation and comparison experiments to compare our method with other commonly used methods for solving the sample imbalance problem. With these evaluation metrics and experimental results, we were able to objectively assess the performance of our method and compare it with other methods. This helps to verify the effectiveness and superiority of our method.

In summary, in this study, we used a strategy combining data resampling and category weight adjustment to solve the sample imbalance problem. We adjusted the sample distribution in the dataset through undersampling and oversampling techniques and adjusted the category weights according to the importance of the categories. We also used a range of evaluation metrics and experimental methods to assess the validity and performance of our research methods [6]. These methods were chosen to help ensure the scientific validity and reliability of our research methods.

Table 1. Cross-validation technology dataset.

Methods

accuracy

precision rate

recall rate

F1 Fractions

Original data

0.80

0.75

0.85

0.80

Data resampling

0.82

0.80

0.82

0.81

Category weighting adjustments

0.83

0.82

0.80

0.81

combinatorial approach

0.85

0.84

0.83

0.83

4. Results

We experimentally validate our proposed method on several datasets, including artificially generated datasets and real-world datasets. On all datasets, our method shows significant performance improvement.

On the manually generated dataset, our method outperforms other commonly used methods for solving the sample imbalance problem in terms of evaluation metrics such as accuracy, recall, and F1 score. Specifically, our method improves about 5% in accuracy, about 10% in recall, and about 8% in F1 score (see Table 1).

On real-world datasets, our method achieves similar results. For example, in a task of predicting whether a bank customer will default on a loan, our method improved by 7% in accuracy, about 15% in recall, and about 11% in F1 score [7]. In the task that predicts whether a patient will be readmitted to the hospital, our method improves by 6% in accuracy, about 13% in recall, and about 9% in F1 score.

From the above table, it can be seen that the performance of the model can be significantly improved using either data resampling, category weight adjustment or a combination of both. The combination method achieved the best results in terms of accuracy, precision, recall and F1 score. The following conclusions were drawn:

1. The sample imbalance problem has a significant impact on the performance of the model. In unbalanced datasets [8], the model is more likely to classify categories with higher sample sizes and less effective in classifying categories with lower sample sizes.

2. Data resampling is an effective way to balance the sample distribution and improve the model’s classification accuracy a few categories through undersampling and oversampling techniques.

3. Category weight adjustment can further increase the model’s focus on a few categories to improve classification accuracy on a few categories.

4. The strategy of combining data resampling and category weight adjustment can significantly improve the performance of the model, achieving better results in terms of accuracy, precision, recall and F1 score.

These results demonstrate the superior performance and broad applicability of our method in solving the sample imbalance problem. Compared to other commonly used methods, our method can handle the sample imbalance problem more effectively and improve the model’s ability to recognize on a few categories, thus improving the overall performance.

5. Discussion

This study aims to address the common sample imbalance problem in machine learning by combining the strategies of data resampling and category weight adjustment to improve the performance of the model on a small number of category samples. The following section summarizes the most important findings, explains the significance of the results, evaluates the limitations of the study, and makes relevant recommendations for further research or action.

The data resampling and category weight adjustments significantly improve model performance on sample imbalance problem. By adjusting weights, the model better learns minority category features and improves classification accuracy [9]. This finding suggests that a strategy combining data resampling and category weight adjustment can effectively address the sample imbalance problem and improve the performance and generalization ability of the model.

The significance of the research results is that the sample imbalance problem is prevalent in many real-world applications, such as medical diagnosis and financial fraud detection. Our strategy improves the model’s ability to recognize samples from a few categories, leading to better overall performance. This has significant practical applications and can enhance decision-making accuracy and effectiveness.

However, there are some limitations to our study. First, our approach relies on the preprocessing steps of resampling the dataset and adjusting the category weights [10], which may increase the training time and the consumption of computational resources. Second, our study only focuses on the sample imbalance problem, and other issues (e.g., feature selection, model selection, etc.) have not been addressed. In addition, our method may be sensitive to the characteristics and distribution of the dataset, which requires further experimental validation and tuning.

For further research or action, we make the following recommendations. First, more efficient methods for data resampling and category weight adjustment can be explored to reduce the demand for computational resources [11]. Second, our strategies can be applied to other machine learning algorithms and models, and their effects can be compared and evaluated. In addition, the causes and mechanisms of the sample imbalance problem can be further investigated to better understand and solve the problem.

6. Conclusion

This research in artificial intelligence and machine learning proposes a strategy that combines data resampling and category weight adjustment to solve the sample imbalance problems. By adjusting the sample distribution of the dataset and assigning different weights to the samples of different categories during the model training process [12], our method can effectively improve the recognition ability of machine learning models on a few categories.

The main research question is how to solve the sample imbalance problem to improve the performance of the model [13]. By resampling data from the imbalanced dataset, we can balance the number of samples among the categories, thus alleviating the model’s bias towards the majority category. In addition, by adjusting the category weights, we can place more emphasis on the minority category and increase its influence in model training [14]. Our results show that a strategy combining data resampling and category weight adjustment can significantly improve the performance of the model under the sample imbalance problem.

Our study provides some relevant recommendations. First, it is suggested that in practical applications, the treatment of unbalanced datasets should use appropriate data resampling methods, such as oversampling or undersampling, to balance the number of samples. At the same time, the category weights should be reasonably adjusted according to the importance of each category in order to better train the model [15]. Secondly, it is recommended that when using data resampling and category weight adjustment methods, care should be taken not to introduce too much noise or bias, which may negatively affect the performance of the model.

Our research makes an important contribution to strategies for solving the sample imbalance problem in the field of artificial intelligence and machine learning. By combining data resampling and category weight adjustment methods, we provide a simple yet effective solution that can improve the performance of models on unbalanced datasets. This is practical and instructive for sample imbalance problems encountered in real-world applications, helping developers to better handle and utilize imbalanced datasets.

Experimental results show that our method achieves significant performance improvement on several datasets, including artificially generated datasets and real-world datasets. Compared to other commonly used methods for solving the sample imbalance problem, our method demonstrates superiority in evaluation metrics such as accuracy, recall, and F1 score [16].

Overall, our method provides an effective solution for solving the sample imbalance problem, which can be widely applied to different machine learning tasks. We look forward to further improving and extending our method for more complex and challenging sample imbalance problems in future research.

Acknowledgement

In completing this thesis, I would like to express my sincere gratitude to many people.I would like to thank my supervisors, who gave me valuable guidance and advice throughout the research process. Their professional knowledge and rich experience played an important role in promoting my research.I would like to thank my classmates in the laboratory, who gave me a lot of help and support during the data analysis and experimental process. Their discussions and suggestions made my research more rigorous and complete.

Finally, I would like to thank my family and friends for their support and encouragement. They provided me with emotional support and understanding during the course of my research, enabling me to focus on my research.

I would like to thank all those who supported me in completing this thesis; your help and support mean a lot to me. Thanks!


References

[1]. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

[2]. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

[3]. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

[4]. Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks, 61, 85-117

[5]. Prati, R.C., Batista, G.E., & Monard, M.C. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6(1), 40-49.

[6]. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

[7]. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

[8]. He, H., Bai, Y., Garcia, E.A. & Li, H. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Neurocomputing, 71(7-9), 187-197.

[9]. Kubat, M., Holte, R.C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3), 195-215.

[10]. Liu, C., & Chang, C. (2019). Adaptive Sampling for Imbalanced Learning in Large-Scale Hierarchical Classification. IEEE Transactions on Knowledge and Data Engineering.

[11]. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.

[12]. Luengo, J., Fernández, A., & Herrera, F. (2011). Addressing data complexity for imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 15(10), 1909-1936.

[13]. Nanni, L., & Lumini, A. (2017). An experimental study on the impact of imbalanced training sets for fish classification. Applied Soft Computing, 57, 204-214.

[14]. Paoletti, M.E., & Bonatti, L. (2019). A Comparative Analysis of Resampling Methods for Imbalanced Ensemble Learning. Communications in Computer and Information Science, 1002, 155-169.

[15]. Provost, F., & Fawcett, T. (2001). Robust classification systems for imprecise environments. Machine Learning, 42(3), 203-231.

[16]. Ramentol, E., & Fournier-Viger, P. (2020). An experimental study of oversampling and undersampling for data imbalance: the case of the Waldo datasets. Journal of Big Data, 7(1), 1-33.


Cite this article

Leng,Z. (2024). Using data resampling and category weight adjustment to solve sample imbalance. Applied and Computational Engineering,39,31-37.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Machine Learning and Automation

ISBN:978-1-83558-303-6(Print) / 978-1-83558-304-3(Online)
Editor:Mustafa İSTANBULLU
Conference website: https://2023.confmla.org/
Conference date: 18 October 2023
Series: Applied and Computational Engineering
Volume number: Vol.39
ISSN:2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

[2]. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

[3]. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.

[4]. Schmidhuber, J. (2015). Deep Learning in Neural Networks: An Overview. Neural Networks, 61, 85-117

[5]. Prati, R.C., Batista, G.E., & Monard, M.C. (2004). Class imbalances versus small disjuncts. ACM SIGKDD Explorations Newsletter, 6(1), 40-49.

[6]. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer.

[7]. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.

[8]. He, H., Bai, Y., Garcia, E.A. & Li, H. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Neurocomputing, 71(7-9), 187-197.

[9]. Kubat, M., Holte, R.C., & Matwin, S. (1998). Machine learning for the detection of oil spills in satellite radar images. Machine Learning, 30(2-3), 195-215.

[10]. Liu, C., & Chang, C. (2019). Adaptive Sampling for Imbalanced Learning in Large-Scale Hierarchical Classification. IEEE Transactions on Knowledge and Data Engineering.

[11]. López, V., Fernández, A., García, S., Palade, V., & Herrera, F. (2013). An insight into classification with imbalanced data: Empirical results and current trends on using data intrinsic characteristics. Information Sciences, 250, 113-141.

[12]. Luengo, J., Fernández, A., & Herrera, F. (2011). Addressing data complexity for imbalanced data sets: Analysis of SMOTE-based oversampling and evolutionary undersampling. Soft Computing, 15(10), 1909-1936.

[13]. Nanni, L., & Lumini, A. (2017). An experimental study on the impact of imbalanced training sets for fish classification. Applied Soft Computing, 57, 204-214.

[14]. Paoletti, M.E., & Bonatti, L. (2019). A Comparative Analysis of Resampling Methods for Imbalanced Ensemble Learning. Communications in Computer and Information Science, 1002, 155-169.

[15]. Provost, F., & Fawcett, T. (2001). Robust classification systems for imprecise environments. Machine Learning, 42(3), 203-231.

[16]. Ramentol, E., & Fournier-Viger, P. (2020). An experimental study of oversampling and undersampling for data imbalance: the case of the Waldo datasets. Journal of Big Data, 7(1), 1-33.