Predict credit risk with XGBoost

Wenhao Wang; Xiyi Zuo; Dantong Han

doi:10.54254/2755-2721/74/20240462

1.Introduction

With the advent of the Internet era, for banks in major financial systems, credit loans have brought great convenience, but at the same time, there are also some inevitable potential crises. The increase in credit loan problems over a short period of time in previous years was at the heart of the crisis in most financial institutions [1]. Most banks in the financial industry focus on credit risk due to the global economic crisis from 2007 to 2008 [2]. The financial industry uses forecasting methods to reduce the risk of credit risk. Financial companies begin to accurately assess risks to utilize resources more efficiently [3]. The financial sector uses algorithms and risk-based financial models for forecasting to combat credit risk. Model performance is used in most consumer lending to evaluate whether to grant a loan, in the financial sector [1]. This article will use machine learning methods to predict whether the borrower will have credit loan risk for bank based on characteristics of the borrower. We solve this problem by using XGBoost method. In this paper, we will build a XGBoost model to predict credit loan risk. The rate of receiver operating characteristic curve (ROC) is 0.87 and the rate of Kolmogorov-Smirnov test (KS) is 0.56. We realized there was development for improvement in this machine learning model. The result of paper is recognised as follows. Section 2 describes the literature review. Whereas, Section 3 shows methodology. Section 4 represents the experimental results and their analysis. The last Section presents a summary of this paper.

2.Literature Review

Some scholars use machine learning technology to predict the risk of defaulting on credit loans. Ali et al. predict the risk of late repayments on a credit loan by applying a variety of machine learning techniques including variations of artificial neural networks, ensemble classifiers, and decision trees [1]. They finally find that Artificial Neural Network-Multilayer Perceptron (ANN-MLP) is a useful tool to handle the risk for bank which has more benefits than using traditional binary logistic regression technique. However, Xu et al. using borrower characteristics such as asset status, industry development, profitability, and working capital turnover studied that the random forest model can identify default objects well through the feature importance analysis of the model by methods including random forest, extreme gradient boosting tree, gradient boosting model and neural network [4]. For other scholars have different opinions about the methods. Decision trees have more accurate predictions compared to k-nearest neighbors and neural networks [3]. Ensemble technologies may hold better prospects. Each model has its own risks and challenges, and ensemble techniques perform better than single classifiers [2].

3.Methodology

3.1.XGBoot Model establishment and application

3.1.1.XGBoot Model establishment

XGBoost(Extreme Gradient Boosting)is a powerful gradient boosting machine learning algorithm that builds a strong model by iteratively training multiple weak learners (Usually it is decision tree), each iteration stage corrects the error of the previous stage model.

Figure 1. Modeling Building

3.1.2.Objective function

The core of XGBoost is to define an objective function, which needs to be minimized. The objective function consists of two parts: loss function and regularization term

*Objective function = loss function + regularization term

\( Obj(ϕ)=L(ϕ)+Ω(ϕ) \)

Include

\( ϕ=\lbrace {ω_{i}}|i=1,2,3,…,d\rbrace \)

*Loss function: used to measure the difference between the model’s predicted value and the actual label.

\( L=\sum _{i=1}^{n}l({y_{i}},\hat{{y_{i}}}) \)

* Regularization term: used to prevent model overfitting. XGBoost uses L1 (Lasso regularization) and L2 (Ridge regularization) regularization terms. GBoost's regularization term helps control the complexity of the tree. At each iteration, a regularization term is added to the objective function to prevent the tree from growing too deep or too complex.

L1(Lasso Regularization)

\( Ω(ω)=λ{||ω||_{1}} \)

L2(Ridge Regularization)

\( Ω(ω)=λ{||ω||^{2}} \)

3.1.3.Iterations of Gradient Boosting

XGBoost uses the gradient boosting method to update the model by calculating the gradient of the loss function (the gradient represents the error of the current model for each sample, or the derivative of the loss function relative to the parameters of the model). For classification tasks (log loss):

Loss function:

\( L(y,p)=-[y*log{(p)}+(1-y)*log{(1-p)}] \)

The gradient of the loss function on the model output:

\( \frac{∂L}{∂F(x)}=p-y \)

y is the true label, F(x) is the predicted output of the current model, and p is the model’s probability estimate.

4.Experimental results and their analysis

This part, following an initial exploratory analysis of the Give Me Some Credit dataset on Kaggle (Give Me Some Credit | Kaggle), undertakes data cleansing and variable selection. The data is partitioned into training, validation, and test sets. To address the class imbalance issue within the training set, the positive samples are randomly divided into five equal portions and combined with the negative samples, resulting in five new training sets. This approach aims to reduce the severity of class imbalance within each training subset.

4.1.Experimental Data

This paper will prove the superiority of this model through a series of experimental evaluation indicators. This experimental dataset is from a project launched in Kaggle (Give Me Some Credit | Kaggle), which includes 11 indicators. The column index English name of the dataset along with their meanings are shown in the following table.

Table 1. The column index English name of the dataset along with their meanings

Column Index English Name	Meaning
SeriousDlqin2yrs	Whether overdue or not
RevolvingUtilizationOfUnsecuredLines	Total balance on credit cards and personal credit lines
age	age
NumberOfTime30-59DaysPastDueNotWorse	The number of times borrowers were 30 to 59 days past due in the last two years
NumberOfTime60-89DaysPastDueNotWorse	The number of times borrowers were 60 to 80 days past due in the last two years
NumberOfTimes90DaysLate	The number of times borrowers were over 90 days past due in the last two years
DebtRatio	Debt Ratio
MonthlyIncome	Monthly Income
NumberOfOpenCreditLinesAndLoans	Number of outstanding loans (installment payments such as car loans or mortgages) and lines of credit (such as credit cards)
NumberRealEstateLoansOrLines	Number of mortgages and real estate loans
NumberOfDependents	Number of families in the household (spouses, children, etc.)

The above table discusses a total of 11 indicators, with "SeriousDlqin2yrs" serving as an indicator for whether a borrower has experienced a loan delinquency. The remaining 10 indicators represent dependent variables used in this project to assess whether a borrower has experienced a loan delinquency. This dataset comprises 150,000 rows of data.

4.2.Dataset dimension partitioning

A first look at the training dataset, with 150,000 rows of data. A preliminary review found a total of 609 data duplicates. After the duplicate value is deleted, 149391 rows of data remain.

In addition to “SeriousDlqin2yrs” as an indicator of whether the borrower is overdue, the remaining 10 indicators can be divided into the following four dimensions: personal information, repaying capability, credit standing, debt situation. Figure 2 shows the detailed categories.

/word/media/image1.jpeg

Figure 2. Dataset dimension partitioning

4.3.Data visualization

Through python, the dataset was presented by different graphs such as histograms, box plots, and heat map. The results will be shown as follows.

4.3.1.Personal information

/word/media/image2.png

Figure 3. Histogram of overdue status and number of dependents

From Figure 3, there’s no obvious correlation between the number of dependents and overdue status.

/word/media/image3.png

Figure 4. Normal distribution of sample age

As can be seen from Figure 4, the sample age basically conforms to the normal distribution.

/word/media/image4.png

Figure 5. Histogram of the number of people who are overdue or not under each age group

From Figure 5 above, it is evident that the older the individuals are, the greater the number of non-delinquent cases. Those aged 60 and above constitute the largest demographic of non-delinquent individuals, whereas the age group between 40 and 50 years exhibits the highest number of delinquent cases. Additionally, when examining the delinquency rate, it becomes apparent that the probability of experiencing delinquency decreases as individuals grow older.

4.3.2.Repaying capability

/word/media/image5.png

Figure 6. Histogram of overdue status and income status

As can be seen in Figure 6, it is evident that the income range for individuals with delinquencies primarily falls within the range of 2500 yuan to 3725 yuan, while those without delinquencies tend to have their income concentrated in the range of 2500 yuan to 6250 yuan.

4.3.3.Credit standing

4.3.3.1.Historical performance records

/word/media/image6.png

Figure 7. Histogram of the number of overdue times between 30 and 59 days

/word/media/image7.png

Figure 8. Histogram of the number of overdue times between 60 and 89 days

/word/media/image8.png

Figure 9. Histogram of the number of overdue times 90 days late

Within the historical performance records of the overdue group, shown in Figure 7 to Figure 9, the proportion of occurrences of overdue is greater than the proportion in the non-overdue group.

4.3.3.2.Revolving utilization of used lines

/word/media/image9.png

Figure 10. Histogram of the ratio of overdue to total balance

According to Figure 10, it can be seen that in the overdue group, the higher the total balance, the larger the proportion.

4.3.4.Debt situation

/word/media/image10.png

Figure 11. Histogram of overdue status and open credit lines and loans

From Figure 11, there’s no obvious correlation between the number of outstanding loans and lines of credit, and overdue status.

/word/media/image11.png

Figure 12. Histogram of overdue status and debt ratio

From Figure 12, it can be observed that within the debt ratio range of 0.5 to 1.75, the overdue group's proportion of individuals is slightly greater compared to the non-overdue group.

/word/media/image12.png

Figure 13. Histogram of overdue status and number of real estate lines or loans

From Figure 13, it can be seen that the proportion of people in the overdue group who have 1 or 2 sets of mortgages is smaller than that of the non-overdue group. That is, the probability of overdue borrowers with mortgages and real estate as security is smaller than that of borrowers without mortgages and real estate as security.

In summary, it can be preliminarily seen that variables in personal information such as age, repaying capability, number of mortgage loans in debt situation, and credit status can predict whether the borrower will be overdue in the future.

4.4.Data cleaning

4.4.1.Missing value handling

Missing values are found in the columns of monthly income and number of dependents. In order to handle this, the number of dependents is filled with its mode while the missing value of monthly income are deleted.

4.4.2.Outlier processing

After processing the missing values, using the box plot shown in Figure 14, it can be found that the difference between the maximum value of the remaining amount, debt ratio and personal income and the average value is too large, which is obviously greater than the average value +3 standard deviations. Due to the large value of the outlier, the average value is too close to the X-axis on the Y-axis. Some values of age, the number of outstanding loans, and the number of family members are greater than the upper limit of the box plot, but the numerical differences are not too great. There are obvious outliers for 30-59 days, 60-89 days and 90 days overdue. These results have been illustrated in Table 2.

/word/media/image13.png

Figure 14. Box plot

Table 2. Outlier processing result

Outlier	Outlier processing result
Age	Delete the upper boundary of 96 and the lower boundary of 0
Debt ratio	The extreme outliers (Q3+3*IQR) of the boxplot were calculated to filter
RevolvingUtilizationOfUnsecuredLines
Monthly income	Delete values that greater than 500 thousand
NumberOfTime30-59DaysPastDueNotWorse	Delete the value of 98
NumberOfTime60-89DaysPastDueNotWorse
NumberOfTimes90DaysLate
NumberOfOpenCreditLinesAndLoans	Delete values that id greater than 20
NumberRealEstateLoansOrLines	Delete values that id greater than 20

4.5.Variable selection

The Pearson correlation coefficient is used to prevent the linear correlation between variables. If the coefficient is greater than 0.6, the redundant variables with high correlation need to be deleted.

According to Figure 15, the coefficient is between each variables are less than 0.6, so there is no variable has to be deleted.

/word/media/image14.png

Figure 15. Correlation Coefficient Heat Map

5.XGBoost Model application [5]

5.1.The reason for selecting the XGBoost model

We need to classify the features. There are many methods. In order to achieve the optimal feature classification, we use these methods to obtain the cv value through python (CV is the coefficient of variation, which is the ratio of the standard deviation to the average value in percentage. Finally we choose XGBoost for feature classification.

Table 3. CV-Means of different methods

Model	cv_mean
KNN	0.5936
Decision Tree	0.6101
Random Forest	0.8381
XGBoost	0.8575

5.2.Model optimization and parameter adjustment

(1) learning_rate:A large learning rate gives greater weight to the contribution of each tree in the ensemble, but this may lead to overfitting/instability and speed up training time. While a lower learning rate suppresses the contribution of each tree, making the learning process slower but more robust. This regularizing effect of the learning rate parameter is particularly useful for complex and noisy data sets.

(2) max_depth:maximum depth (max_depth) controls the maximum number of levels a decision tree may reach during training. Deeper trees can capture more complex interactions between features. However, deeper trees also have a higher risk of overfitting because they can remember noise or irrelevant patterns in the training data. To control this complexity, we can limit max_depth result in shallower, simpler trees that capture more general patterns. The Max_depth value provides a good balance between complexity and generalization.

(3) alpha, lambda:alpha (L1) and lambda (L2) are two regularization parameters that help overfitting.

The difference from other regularization parameters is that they can reduce the weight of unimportant or unimportant features to 0 (especially alpha), resulting in a model with fewer features and thus reduced complexity. The result of Alpha and lambda may be affected by other parameters such as max_depth､subsamplean and colsample_bytree. Higher alpha or lambda values may require adjusting other parameters to compensate for the increased regularization.

Final parameter result:

Table 4. Parameter values

parameter	Value	parameter	Value
learning_rate	0.01	reg_lambda	0.01
max_depth	6	reg_alpha	0.5
num_leaves	16	n_estimators	575
min_child_samples	22	min_child_weight	0.0001

6.Model testing

6.1.ROC curves

Coefficient: 0.87

The ROC curve coefficient is one of the important indicators to measure the classification accuracy of the model. Its value range is between 0 and 1. The closer it is to 1, the stronger the classification ability of the model. The ROC curve coefficient of 0.86 shows that the model has a good ability to distinguish between positive and negative examples. This means that the model can accurately distinguish positive examples from negative examples and can maintain a high true positive rate and a low false positive rate under different thresholds. This result shows that the model has high prediction accuracy and reliability and can provide valuable prediction results in classification tasks.

/word/media/image15.png

Figure 16. Receiver operating characteristic

6.2.KS curves

Coefficient: 0.56

The KS curve coefficient is one of the important indicators to measure the discrimination of the model. Its value range is between 0 and 1. The closer to 1, the better the discrimination of the model. In this case, the Figure 17 shows the KS curve coefficient of 0.56 indicates that the model is good at distinguishing between positive and negative examples, but not very strong. This means that the model can distinguish positive and negative examples to a certain extent, but there is room for improvement.

/word/media/image16.png

Figure 17. K-S curve

7.Conclusion

7.1.Summary

This paper conducts an in-depth analysis of credit risk in financial institutions, employing machine learning techniques to assess the credit risk based on borrower characteristics. The study demonstrates that the Extreme Gradient Boosting (XGBoost) model outperforms traditional methods such as KNN, Random Forest, and Logistic Regression in accurately identifying credit risks. After data pre-processing the "Give Me Some Credit" dataset from Kaggle, which involved data cleansing, variable selection, partitioning, and addressing class imbalance, the XGBoost model was applied, resulting in a ROC curve coefficient of 0.87 and a KS curve coefficient of 0.56. These results indicate the model's proficiency in distinguishing between positive and negative samples, albeit with room for improvement.

7.2.Future Works

Future work could enhance this research from several perspectives: Firstly, exploring more feature engineering techniques and data pre-processing methods could improve model performance. Secondly, given the potential of different models to excel in specific scenarios, employing model ensemble techniques to combine the strengths of various machine learning models could further increase predictive accuracy. Additionally, to enhance the model's generalizability, testing and validating the model on a broader and more diverse dataset is recommended. Through the exploration and implementation of these approaches, the predictive capability of the model regarding credit risk can be further strengthened, providing more reliable decision support for financial institutions.

Acknowledgments

Wenhao Wang, Xiyi Zuo, Dantong Han, which authors contributed to the work equally and should be regarded as co-first authors.

References

[1]. Ali, SEA, Rizvi, SSH, Lai, F, Ali, RF & Jan, AA (2021), Predicting Delinquency on Mortgage Loans: An exhaustive parametric comparison of machine learning techniques, International Journal of Industrial Engineering and Management [online] 12(1):pp.1–13. Available at: https://doi.org/10.24867/ijiem-2021-1-272 [Accessed 10 October 2023].

[2]. Bhatore, S, Mohan, S & Reddy, YR (2020) Machine learning techniques for credit risk evaluation: a systematic literature review, Journal of Banking and Financial Technology, [online] 4(1):pp.111–138. Available at: https://doi.org/10.1007/s42786-020-00020-3 [Accessed 10 October 2023].

[3]. Galindo, JF & Tamayo, P (2000), Credit Risk Assessment Using Statistical and Machine Learning: Basic Methodology and Risk Modeling Applications, Computational Economics [online] 15(1/2):pp.107–143, Available at: https://doi.org/10.1023/a:1008699112516 [Accessed 10 October 2023].

[4]. Xu, J., Lu, Z. and Xie, Y. (2021) Loan default prediction of Chinese P2P market: A machine learning methodology. Scientific Reports [online] 11(1) Available at: https://dx.doi.org/10.1038/s41598-021-98361-6 [Accessed 10 October 2023].

[5]. Zhang, F., Jing, Y., Guo, Y. Q., & Gu, H. (2020). Multi-source heterogeneous and XBOOST vehicle sales forecasting model. In Advances in intelligent systems and computing [online] pp. 340–347, Available at: https://doi.org/10.1007/978-3-030-62746-1_50 [Accessed 20 October 2023]

Cite this article

Wang,W.;Zuo,X.;Han,D. (2024). Predict credit risk with XGBoost. Applied and Computational Engineering,74,164-177.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2nd International Conference on Software Engineering and Machine Learning

ISBN：978-1-83558-507-8(Print) / 978-1-83558-508-5(Online)

Editor：Stavros Shiaeles

Conference website: https://www.confseml.org/

Conference date: 15 May 2024

Series: Applied and Computational Engineering

Volume number: Vol.74

ISSN：2755-2721(Print) / 2755-273X(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).