Utilizing the LightGBM algorithm for operator user credit assessment research

1. Introduction

Credit assessment originated in the United States in the early 1900s. As early as 1902, John Moody's, the founder of Moody's, first began to rate railroad securities, using empirical methods to classify the credit ratings of securities [1]. Subsequently, the credit rating method based on historical experience began to be widely used in various financial fields, and it became an important way to predict the possibility of default of large financial entities.
Since the last century, the basic models used in the industry and academia for personal credit assessment are mainly three types: expert scoring model (ES-Model, Expert Scoring Model) [2], statistical model (S-Model, Statistical Model)[3] and artificial intelligence model (AI-Model, Artificial Intelligence Model)[4].
Due to the advantages of data processing efficiency and accuracy, the credit assessment method based on machine learning has played an increasingly important role in the credit assessment industry. The credit assessment method has shifted from the traditional experience-driven to data-driven. In practical applications, due to the continuous growth of data volume and application requirements[5], a single machine learning method has been unable to meet the growing requirements of engineering problems. The fusion algorithm based on multiple machine learning knowledge and feature engineering has become a new research focus. These fusion models that "learn from the strengths of others" provide new ideas for solving credit assessment problems.
In terms of personal user credit assessment, communication operators have unique advantages. The ubiquitous mobile payment helps operators to explore the flow of personal users' wealth, the widely constructed telecommunications base stations help operators to grasp the traffic trends of personal users, and the personal information continuously pouring in from the mobile Internet makes it possible for operators to observe personal users in almost every aspect. Compared with banks that only grasp the financial behavior of personal users, operators have an advantage in data volume[6], but still face difficulties in the rational use of data.
In the pursuit of operator user credit assessment research, employing the LightGBM algorithm and integrating the latest research findings and methodologies is essential. The studies by Zhu et al. emphasize the enhancement of credit prediction models by incorporating various ensemble methods to improve accuracy and performance [7]. Additionally, another study by Zhu et al. presents a method that synergizes neural networks with the Synthetic Minority Over-sampling Technique (SMOTE), effectively boosting the detection capabilities for credit card fraud [8]. These innovative approaches offer valuable insights and applicable methodologies for operator user credit assessment.
In actual use, due to different scenarios, data, etc., a single model cannot achieve ideal results in all scenarios. Therefore, the model fusion theory is proposed, which aims to integrate the advantages of different models, construct an ensemble learning method (Ensemble Learning)[9], and form a lower variance, smaller deviation, and better performance ensemble model (Ensemble Model).

2. Related Work

2.1. Random forest

Random forest is an integrated algorithm that uses a collection of CART decision trees for classification and regression[10]. Due to the introduction of randomly generated samples, the overfitting situation of the decision tree algorithm can be greatly improved. Random forest uses the Bagging homogeneous integration method to reduce variance while maintaining low bias.

2.2. GBDT algorithm

It is not difficult to see that the GBDT algorithm is an ensemble algorithm based on trees similar to random forest [11]. The difference is that the Boosting integration is used to train a group of decision trees in sequence. Each subsequent tree will reduce the error of the previous tree by using the residual of the previous model to fit the next model. Since GBDT is trained sequentially, it is generally believed that it is slower and less scalable than random forest, which can train multiple trees in parallel. However, compared with random forest, GBDT usually uses shallower trees, which means that GBDT can train faster. Increasing the number of trees in GBDT will increase the chance of overfitting (GBDT reduces deviation by using more trees), while increasing the number of trees in random forest will reduce the chance of overfitting (random forest uses more trees to reduce variance).

2.3. XGBoost algorithm

XGBoost (eXtreme Gradient Boosting) is one of the best GBDT implementations available today[12]. Its introduced parallel tree enhancement feature makes it much faster than other tree-based ensemble algorithms in terms of establishment speed. In 2015, 17 of the 29 winning solutions of Kaggle used XGBoost, and the top 10 solutions of the 2015 KDD Cup used XGBoost. XGBoost is designed using the general principle of gradient boosting, which combines weak learners into strong learners. Although GBDT is constructed sequentially - learning slowly from the data to improve its predictions in subsequent iterations, the feature-grained parallel construction is implemented on XGBoost. Because decision tree learning takes a lot of time in the best split of features, XGBoost adopts the method of data pre-sorting (Pre-Sorted Method), saves the sorted data information as block units, and supports the repeated use of the subsequent iteration process. The existence of this block unit makes it possible to perform multi-threaded calculations in parallel CPUs for feature gain calculations. XGBoost is also cache-aware, can reduce overfitting by controlling model complexity and built-in regularization, effectively handle sparse data, and can use disk space (rather than just memory) for large datasets to support out-of-core computing, thereby maximizing system computing capabilities and producing better prediction performance.

2.4. LightGBM algorithm

LightGBM is a lightweight gradient boosting algorithm similar to XGBoost. It was released on October 17, 2016 as part of Microsoft's Distributed Machine Learning Toolkit (DMTK)[13]. It is designed to be fast and distributed, so it has faster training speed and lower memory usage, and supports GPU and parallel learning at the same time. The ability to process large datasets. LightGBM has been proven to be faster and more accurate than XGBoost in benchmarks and experiments on multiple public datasets. LightGBM has several advantages over XGBoost. It uses histograms to store continuous features into discrete bins (discrete binning method), which provides LightGBM with several performance advantages over XGBoost (which uses a pre-sort-based algorithm for tree learning by default), such as reducing memory usage, reducing the cost of calculating the gain of each split, and reducing the communication cost of parallel learning. LightGBM calculates the histogram of a node by performing histogram subtraction on its sibling nodes and parent nodes, which allows the node histograms to be reused (only one node needs to be built for each split), resulting in additional performance improvements. Existing benchmarks show that LightGBM is 11 to 15 times faster than XGBoost on some tasks. In addition, the LightGBM algorithm uses a leaf-wise growth (Leaf Wise) strategy, which usually converges faster and achieves lower losses, and is generally more accurate than the layer-wise growth of XGBoost. Since it is an engineering implementation of GBDT, LightGBM also has a high degree of freedom in parameter settings, which will help to establish a machine learning model with better performance using LightGBM. The following parameters can usually be set:

Table 1: LightGBM parameter table

Parameter	Description
Parameter	Description
Max Depth	Set this parameter to prevent the tree from growing too deep, thereby reducing the risk of overfitting.
Num Leaves	Controls the complexity of the tree model. Setting it to a larger value can improve accuracy but may increase the risk of overfitting.
Min Data in Leaf	Setting this parameter to a larger value can prevent the tree from growing too deep.
Max Bin	Controls the number of discrete bins. Smaller values can control overfitting and speed up training, while larger values can improve accuracy.
Feature Fraction	Enables random feature subsampling. Reasonably setting this can speed up training and prevent overfitting.
Bagging Fraction	Specifies the fraction of data to be used for each iteration. Reasonably setting this can speed up training and prevent overfitting.
Num Iteration	Sets the number of boosting iterations. Setting this parameter affects the training speed.
Objective	Set this parameter to specify the type of task the model attempts to perform.

3. Data Processing and Feature Selection

The operator user credit assessment analyzes user behavior data and historical records on the operator side to assess user credit status. Feature engineering is a key link in this process, which aims to mine and extract valuable features so as to more accurately portray user credit status.
Before embarking on model building, it is essential to have a deep understanding of the basic attributes and characteristics of the required data, because this lays a solid foundation for subsequent data preprocessing, feature engineering and modeling work. By analyzing the distribution, outliers, missing conditions, and correlation of data, we can formulate more efficient data processing strategies. This helps us accurately identify and extract key features that have a significant impact on model performance, and thus build a more reliable and better performing credit assessment model. The operator user data fields are shown in Table 2.

Table 2: Feature field description

Number	Feature	Feature Description
1	id	User ID
2	age	User Age
3	net\_age\_till\_now	User's Network Age (months)
4	top\_up\_month\_diff	Months Since Last Top-Up
5	top\_up\_amount	Amount of Last Top-Up (CNY)
6	recent\_6month\_avg\_use	Avg Spending on Calls (Last 6 Months, CNY)
7	total\_account\_fee	Total Bill for the Current Month (CNY)
8	curr\_month\_balance	Current Month Account Balance (CNY)
9	connect\_num	Number of Contacts This Month
10	recent\_3month\_shopping\_count	Avg Monthly Shopping Visits (Last 3 Months)
11	online\_shopping\_count	Online Shopping App Uses This Month
12	express\_count	Express Delivery App Uses This Month
13	finance\_app\_count	Financial Management App Uses This Month
14	video\_app\_count	Video Streaming App Uses This Month
15	flight\_count	Air Travel App Uses This Month
16	train\_count	Train App Uses This Month
17	tour\_app\_count	Travel Info App Uses This Month
18	cost\_sensitivity	Sensitivity to Phone Bill Costs
19	score	Credit Score (Prediction Target)
20	true\_name\_flag	Passed Real-Name Verification
21	uni\_student\_flag	University Student Status
22	blk\_list\_flag	Blacklist Status
23	4g\_unhealth\_flag	4G Unhealthy Customer Status
24	curr\_overdue\_flag	Current Overdue Payment Status
25	freq\_shopping\_flag	Frequent Shopping Mall Visitor
26	wanda\_flag	Visited Fuzhou Cangshan Wanda This Month
27	sam\_flag	Visited Fuzhou Sam's Club This Month
28	movie\_flag	Watched a Movie This Month
29	tour\_flag	Visited Tourist Attractions This Month
30	sport\_flag	Consumed at Sports Venues This Month

Feature selection is the process of selecting a subset of features from the original feature set. An ideal feature subset should be as low in dimensionality as possible and as statistically significant as possible. In order to improve the statistical significance of the feature subset and reduce the dimensionality of the feature set, the feature set was artificially divided into four feature subsets, which are also the four dimensions of the operator's user portrait in this paper.
When performing feature selection, it is not necessary to perform feature segmentation manually. Using methods such as Bagging random sampling may achieve better results when establishing the model, but using random methods will ignore the aspects that have been marked in the data, and the basic models established will lose statistical significance. In order to better observe the correlation between different types of features and credit scores, manual segmentation is selected here.

Table 3: Feature subsets

Number	Feature subset
1	Consumer Capacity
2	Location Trajectory
3	Application Behavior Preference
4	Other

4. Results and Analysis

The following will compare the simulation results of the machine learning models . The main evaluation indicators include MAE (mean absolute error), MAPE (mean absolute percentage error), MSE (mean squared error), RMSE (root mean squared error) and R2 (accuracy).

4.1. Consumption Capacity

The consumer capacity dataset is mainly used to measure the user's related characteristics in the mobile communication business. The basic models of dataset 1 are constructed using linear regression, decision tree, random forest and LightGBM algorithms, respectively. The experimental results are shown in Table 4 and Figure 1.

Table 4: Comparison of models of consumer capacity

Dataset	Method	MAE	MAPE	MSE	RMSE	$$ R^2 $$
Consumer Capacity	Linear Regression (LR)	28.0529	0.4692	1306.8418	36.1503	0.2793
Consumer Capacity	Decision Tree (DT)	24.0616	0.4013	1000.3104	31.6277	0.4483
Consumer Capacity	Random Forest (RF)	23.7165	0.3957	953.7934	30.8835	0.4740
Consumer Capacity	LightGBM (LGB)	22.1867	0.3701	935.7114	30.5894	0.6534

Figure 1. Comparison of basic models of consumer capacity

In the model prediction of consumer capacity, it can be found that LightGBM's indicators are far better than the LR, DT, and RF basic models, proving that it does have good accuracy.

4.2. Location Trajectory

The location trajectory dataset is mainly used to measure the user's daily activity location trajectory related characteristics. The basic models of dataset 2 are constructed using linear regression, decision tree, random forest and LightGBM algorithms, respectively. The experimental results are shown in Table 5 and Figure 2.

Table 5: Comparison of models of location trajectory

Dataset	Method	MAE	MAPE	MSE	RMSE	$$ R^2 $$
Location Trajectory	Linear Regression (LR)	31.6287	0.5296	1619.6993	40.2455	0.1067
Location Trajectory	Decision Tree (DT)	31.0083	0.5191	1570.8188	39.6336	0.1337
Location Trajectory	Random Forest (RF)	30.9696	0.5186	1565.0511	39.5607	0.1369
Location Trajectory	LightGBM (LGB)	30.7300	0.5144	1547.7564	39.3415	0.1464

Figure 2. Comparison of models of location trajectory

In the model prediction of location trajectory, the performance of the four algorithms is not satisfactory, which may be due to the low correlation between location trajectory and user credit, but LightGBM's indicators are still slightly better than the LR, DT, and RF basic models.

4.3. Application Behavior Preference

Application preference is mainly used to reflect the user's application usage, and the basic models of dataset 3 are constructed using linear regression, decision tree, random forest and LightGBM algorithms, respectively. The experimental results are shown in Table 6 and Figure 3.

Table 6: Comparison of models of application behavior preference

Dataset	Method	{MAE}	{MAPE}	{MSE}	{RMSE}	{ $$R^2$$ }
Application Behavior Preference	Linear Regression (LR)	33.3207	0.5579	1776.5497	42.1491	0.0202
Application Behavior Preference	Decision Tree (DT)	28.9453	0.4833	1400.0899	37.4178	0.2279
Application Behavior Preference	Random Forest (RF)	28.6198	0.4779	1353.8066	36.7941	0.2534
Application Behavior Preference	LightGBM (LGB)	23.1702	0.3881	973.7953	31.2057	0.4630

Figure 3. Comparison of models of application behavior preference

In the model prediction of application behavior preference, it can be found that the effect of the basic model established by linear regression is very poor, indicating that the nonlinearity of this dataset is very high, and LightGBM's indicators are far better than the LR, DT, and RF basic models, proving Its processing performance on nonlinear data.

4.4. Other

The other model contains a large number of features built by feature construction and feature extraction processes, which are highly correlated with each other. The basic models of dataset 4 - Other are constructed using linear regression, decision tree, random forest and LightGBM algorithms, respectively. The experimental results are shown in Table 7 and Figure 4.

Table 7: Comparison of basic models of others

Dataset	Method	MAE	MAPE	MSE	RMSE	$$ R^2 $$
Other	Linear Regression (LR)	24.6979	0.4126	1015.5399	31.8675	0.4399
Other	Decision Tree (DT)	23.7093	0.3926	965.4682	31.0720	0.4671
Other	Random Forest (RF)	22.2056	0.3786	942.3304	30.6974	0.6436
Other	LightGBM (LGB)	17.5637	0.2911	539.5605	23.2284	0.7024

Figure 4. Comparison of basic models of others

In the model prediction of other types of data, it can be found that LightGBM still has advantages in various indicators, reflecting the effect of the mutual exclusion feature bundling algorithm in the LightGBM algorithm on highly correlated features. This chapter compares machine learning algorithms such as LightGBM, random forest, decision tree, and linear regression to construct credit assessment basic models, and evaluates the performance of basic models of multiple data sets, and finally selects the best basic model for subsequent chapters. Fusion model establishment.

4.5. Constructing an fusion LightGBM model

The fusion model is a type of ensemble learning model that integrates multiple base models. By further learning from the strengths of different base models, it achieves superior performance. In the work carried out in the previous section, credit evaluation models have been constructed that target aspects such as consumer capacity, location trajectories, and application behavior preferences. This series of models can serve as the basis for establishing a fusion model. The establishment of the fusion model begins by splitting the dataset, after feature engineering, into four subsets: consumer capability, location trajectory, application behavior preference, and others. LightGBM is then applied to each of these subsets to learn from them, and we successfully obtain four corresponding base models for user credit evaluation. To achieve a more refined credit assessment fusion model, we introduced three representative ensemble algorithms as secondary learners: Voting, Blending, and Stacking. By applying these algorithms, we train meta-models (secondary learners) based on the four base models, which then form the final credit evaluation model. Lastly, we conduct a comparative analysis of their performance. This will further enrich our research results in the field of credit assessment and provide operators with a more comprehensive and precise basis for credit evaluation. The experimental results are shown in Table 8.

Table 8: Comparison of ensemble model of full dataset

Dataset	Method	{MAE}	{MAPE}	{MSE}	{RMSE}	{ $$R^2$$ }
Dataset 4 Only	LightGBM	17.5637	0.2911	539.5605	23.2284	0.7024
Full Dataset	LightGBM	17.4223	0.2889	522.6124	22.8607	0.7118
Full Dataset	LightGBM + Voting	17.0166	0.2828	514.3879	22.6801	0.7163
Full Dataset	LightGBM + Blending	14.3725	0.2381	340.6276	18.4561	0.8157
Full Dataset	LightGBM + Stacking	13.1022	0.2166	311.8294	17.6587	0.8280

The experimental results show that the modeling effects of using ensemble learning techniques are generally better than those without ensemble learning. Even the simplest LightGBM-Voting ensemble algorithm outperforms the global LightGBM model. Among the ensemble learning effects, Voting is significantly behind Stacking and Blending, proving that establishing a secondary learner and implementing weight updates is a crucial aspect of the ensemble; although it takes more modeling time, Stacking is slightly superior to Blending in all metrics. Therefore, it is considered that the LightGBM-Stacking ensemble method is the most advantageous in credit evaluation models.

5. Conclusion

5.1. Summary

The construction of the credit assessment model of the operator's users is an important direction for the informatization and digital transformation of the operator, and it is also an important way for the operator to ensure the expected benefits. This paper uses feature engineering technology, multiple machine learning technologies and multiple ensemble learning technologies to build a mixed model for operator user credit assessment, and conducted multiple comparisons, which are mainly divided into the following three aspects:

[label=(\arabic*)]
Application research of feature engineering: In order to solve the problem of feature extraction of desensitized data of personal users' mobile Internet usage, this paper uses feature engineering methods to establish features of available parts of data, and uses database classification methods to establish multiple subsets of the original data set, not only retains the statistical significance of the data, but also improves the usability of the data, and has significantly improved the establishment time and accuracy compared with the traditional scheme.
Research on supervised learning modeling: At present, most of the domestic analysis and research on credit assessment still adopt the first-generation supervised learning model that is not combined with ensemble learning. In order to solve the problem of massive data feature modeling, this paper conducts multiple modeling on different types of features and selects the best ones, which has better accuracy, and the model is more in line with the actual situation and has better reference.
Different Ensemble Algorithm Research: Currently, most traditional applications that do not involve the segmentation and integration of multi-dimensional datasets adopt only one type of ensemble method or do not use ensemble methods at all for modeling. This approach may lead to performance overlaps in models when dealing with complex data characteristics. To address the challenge of integrating massive data features, this paper builds on the introduction of the Boosting ensemble method within the LightGBM algorithm and further implements Stacking/Blending integration from model fusion theory to establish the final model. By comparing with the Averaging/Voting ensemble and a global LightGBM algorithm that only utilizes Boosting ensemble, it is demonstrated that the established fusion model exhibits superior performance.

References

[1]. Meindert Fennema. International networks of banks and industry, volume 2. Springer Science & Business Media, 2012.

[2]. Is¸ık Bic¸er, Deniz Sevis, and Taner Bilgic¸. Bayesian credit scoring model with integration of expert knowledge and customer data. In International Conference 24th ini EURO Conference “Continuous Optimization and Information Technologies in the FFinancial Sector”(MEC EurOPT 2010), pages 324–329, 2010.

[3]. Sadanori Konishi. Statistical model evaluation and information criteria. In Multivariate Analysis, Design of Experiments, and Survey Sampling, pages 393–424. CRC Press, 1999.

[4]. Murugesan Punniyamoorthy and P Sridevi. Identification of a standard ai based technique for credit risk analysis. Benchmarking: An International Journal, 23(5):1381–1390, 2016.

[5]. Erik Hofmann. Big data and supply chain decisions: the impact of volume, variety and velocity properties on the bullwhip effect. International Journal of Production Research, 55(17):5108–5126, 2017.

[6]. Ning Chen, Bernardete Ribeiro, and An Chen. Financial credit risk assessment: a recent review. Artificial Intelligence Review, 45:1–23, 2016.

[7]. Mengran Zhu, Ye Zhang, Yulu Gong, Kaijuan Xing, Xu Yan, and Jintong Song. Ensemble methodology: Innovations in credit default prediction using lightgbm, xgboost, and localensemble. arXiv preprint arXiv:2402.17979, 2024.

[8]. Mengran Zhu, Ye Zhang, Yulu Gong, Changxin Xu, and Yafei Xiang. Enhancing credit card fraud detection: A neural network and smote integrated approach. Journal of Theory and Practice of Engineering Science, 4(02):23–30, 2024.

[9]. Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.

[10]. Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.

[11]. Si Si, Huan Zhang, S Sathiya Keerthi, Dhruv Mahajan, Inderjit S Dhillon, and Cho-Jui Hsieh. Gradient boosted decision trees for high dimensional sparse output. In International conference on machine learning, pages 3182–3190. PMLR, 2017.

[12]. Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794, 2016.

[13]. Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.

Cite this article

Li,S.;Dong,X.;Ma,D.;Dang,B.;Zang,H.;Gong,Y. (2024). Utilizing the LightGBM algorithm for operator user credit assessment research. Applied and Computational Engineering,75,36-47.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2nd International Conference on Software Engineering and Machine Learning

ISBN：978-1-83558-509-2(Print) / 978-1-83558-510-8(Online)

Editor：Stavros Shiaeles

Conference website: https://www.confseml.org/

Conference date: 15 May 2024

Series: Applied and Computational Engineering

Volume number: Vol.75

ISSN：2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Meindert Fennema. International networks of banks and industry, volume 2. Springer Science & Business Media, 2012.

[3]. Sadanori Konishi. Statistical model evaluation and information criteria. In Multivariate Analysis, Design of Experiments, and Survey Sampling, pages 393–424. CRC Press, 1999.

[4]. Murugesan Punniyamoorthy and P Sridevi. Identification of a standard ai based technique for credit risk analysis. Benchmarking: An International Journal, 23(5):1381–1390, 2016.

[6]. Ning Chen, Bernardete Ribeiro, and An Chen. Financial credit risk assessment: a recent review. Artificial Intelligence Review, 45:1–23, 2016.

[9]. Omer Sagi and Lior Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4):e1249, 2018.

[10]. Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.