
Multi-Label Sampling based on Label Imbalance Rate and Neighborhood Distribution
- 1 University of Reading
* Author to whom correspondence should be addressed.
Abstract
Existing multi-label classification algorithms often assume that label distribution in the training set is balanced, but practical datasets frequently exhibit significant label imbalance. This imbalance affects the learning and generalization performance of the classifiers. To address the problem of label imbalance in multi-label classification, this paper proposes a new synthetic oversampling algorithm, named Multi-Label Synthetic Oversampling based on Label Imbalance Rate and Neighborhood Distribution (MLSIN). This algorithm synthesizes new samples by considering both the imbalance rate of labels and the distribution of samples in their neighborhood, aiming to improve the classifier’s performance on minority labels. The rest of this chapter first introduces the evaluation metrics for multi-label classification effectiveness. Then it defines and computes the degree of label imbalance, describes the calculation of imbalance weights, and proposes a sample type correction penalty strategy, detailing the algorithm's process for selecting base and auxiliary samples., and validates the proposed method on public datasets and summarizes the experimental results.
Keywords
Multi-Label classification, Class imbalance, Imbalance Rate, Heuristic sampling
[1]. G. Tsoumakas, I. Katakis, I. Vlahavas, Mining Multi-label Data [C]// Data Mining and Knowledge Discovery Handbook, 2009: 667-685.
[2]. Zhang M. L., Zhou Z. H. A review on multi-label learning algorithms [J]. IEEE transactions on knowledge and data engineering, 2013, 26(8): 1819-1837.
[3]. Tsoumakas G., Katakis I. Multi-label classification: An overview [J]. International Journal of Data Warehousing and Mining (IJDWM), 2007, 3(3): 1-13..
[4]. Buczak A. L., Guven E. A Survey of Data Mining and Machine Learning Methods for Cyber Security Intrusion Detection [J]. IEEE Communications Surveys & Tutorials, 2015, 18(2): 1153-1176.
[5]. Zhou F., Huang S., Xing Y. Deep semantic dictionary learning for multi-label image classification [C] Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(4): 3572-3580.
[6]. Harding S. M., Benci J. L., Irianto J., et al. Mitotic progression following DNA damage enables pattern recognition within micronuclei [J]. Nature, 2017, 548(7668): 466-470.
[7]. Zhu X., Li J., Ren J., et al. Dynamic ensemble learning for multi-label classification [J]. Information Sciences, 2023, 623: 94-111.
[8]. B. Wu, E.H. Zhong, A. Horner, Q. Yang, Music emotion recognition by multi-label multi-layer multi-instance multi-view learning [C]// Proceedings of the 22nd ACM International Conference on Multimedia ACM, 2014: 117-126.
[9]. Rastogi R., Kumar S. Discriminatory label-specific weights for multi-label learning with missing labels [J]. Neural Processing Letters, 2023, 55(2): 1397-1431.
[10]. Chen Ming-Syan, Han Jiawei, Yu P.S. Data mining: An Overview from a Database Perspective [J]. IEEE Transactions on Knowledge and Data Engineering, 1996, 8(6): 866-883.
[11]. M.R. Boutell, J. Luo, X. Shen, C.M. Brown, Learning multi-label scene classification [J]. Pattern Recognition, 2004, 37(9): 1757-1771.
[12]. Zhang M. L., Li Y. K., Yang H, et al. Towards class-imbalance aware multi-label learning [J]. IEEE Transactions on Cybernetics, 2020, 52(6): 4459-4471.
[13]. Tarekegn A. N., Giacobini M., Michalak K. A review of methods for imbalanced multi-label classification [J]. Pattern Recognition, 2021, 118: 107965.
[14]. Mollas I., Chrysopoulou Z., Karlos S., et al. ETHOS: a multi-label hate speech detection dataset [J]. Complex & Intelligent Systems, 2022, 8(6): 4663-4678.
[15]. Charte F., Rivera A. J., del Jesus M. J., et al. Addressing imbalance in multilabel classification: Measures and random resampling algorithms [J]. Neurocomputing, 2015, 163: 3-16.
[16]. Charte F., Rivera A. J., del Jesus M. J., et al. MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation [J]. Knowledge-Based Systems, 2015, 89: 385-397.
[17]. Chawla N. V., Bowyer K. W., Hall L. O., et al. SMOTE: synthetic minority over-sampling technique [J]. Journal of artificial intelligence research, 2002, 16: 321-357.
[18]. Pereira R. M., Costa Y. M. G., Silla Jr C. N. MLTL: A multi-label approach for the Tomek Link undersampling algorithm [J]. Neurocomputing, 2020, 383: 95-105.
[19]. Charte F., Rivera A., del Jesus M. J., et al. Resampling multilabel datasets by decoupling highly imbalanced labels [C]// Hybrid Artificial Intelligent Systems: 10th International Conference, HAIS 2015, Bilbao, Spain, June 22-24, 2015, Proceedings 10. Springer International Publishing, 2015: 489-501.
[20]. Liu B., Blekas K., Tsoumakas G. Multi-label sampling based on local label imbalance [J]. Pattern Recognition, 2022, 122: 108294.
[21]. [1] Zhang K., Mao Z., Cao P., et al. Label correlation guided borderline oversampling for imbalanced multi-label data learning [J]. Knowledge-Based Systems, 2023, 279: 110938.
Cite this article
Zhang,Z. (2024). Multi-Label Sampling based on Label Imbalance Rate and Neighborhood Distribution. Applied and Computational Engineering,57,104-111.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 6th International Conference on Computing and Data Science
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).