1. Introduction
In recent years, as research on factor investing has matured, an increasing number of active investors have shifted their focus to the factor market [1]. With the growing number of newly listed stocks in the A-share market, the time and effort spent on researching and individually selecting A-share stocks have also increased. More funds are now concentrated on selecting a basket of stocks within a specific category, which has heightened the importance of factor exploration.
This paper focuses on a research study of narrow-based indices composed of 29 different primary industries in China. By constructing a factor model to assess the trend of these indices, the aim is to assist investors in their allocation and investment decisions regarding the corresponding ETFs. Since determining the directional trend of industry indices is a classification problem, it is necessary to handle a large amount of discrete data efficiently. To address this, we introduce machine learning models capable of handling large-scale data effectively. By constructing an industry index prediction model for sector allocation, we aim to obtain market outperformance in terms of alpha returns.
2. Literature Review
The concept of factors originates from the Capital Asset Pricing Model (CAPM) proposed by William Sharpe [2], which decomposes stock returns into market portfolio returns (beta) and excess returns (alpha). Building upon this, the Arbitrage Pricing Theory (APT) extends the stock return pricing model by analyzing the contribution of different risk factors to returns [3]. These factors can be classified into two categories: macro factors, including economic growth, financial data, and commodity prices, and fundamental factors, primarily encompassing valuation indicators. This paper examines the predictive capabilities of linear and nonlinear machine learning models on macro and fundamental factors separately, aiming to analyze the similarities and differences in price forecasting. It provides new insights into the field of industry asset allocation.
2.1. Logistic Regression in Linear Models
The logistic regression model is commonly used in regression problems within linear models, and its formula is represented as equation (1).
\( P(y=1∣x)=\frac{{e^{x\vec{β}}}}{1+{e^{x\vec{β}}}} \) (1)
This formula indicates that when the probability of x belonging to y=1 is greater than 1/2, the predicted value is 1; otherwise, the predicted value is 0.
To select the most effective factors, a penalty term is usually added to choose the most predictive feature variables [4]. Generally, high-dimensional data tends to significantly increase the difficulty of model prediction as the sample size increases. However, the accuracy of model prediction does not necessarily increase with the increase in prediction difficulty. To reduce the prediction difficulty and improve the prediction accuracy of the model, regularization techniques are commonly employed. Common regularization methods include Lasso and Ridge. Lasso reduces the dimensionality of the model and selects a smaller number of features, discarding some non-important feature variables, to address the problem of model overfitting and enhance the generalization ability of the model (equation (2)). Ridge, on the other hand, is a method that compresses the weights of model factors to reduce the complexity of the model (equation (3)).
\( L(\vec{β})=||y-X\vec{β}{||^{2}}+λ||\vec{β}{||_{1}} \) (2)
\( L(\vec{β})=||y-X\vec{β}{||^{2}}+λ||\vec{β}{||^{2}} \) (3)
Among them, the 1-norm of β is represented by the \( |\vec{|β|}{|_{1}} \) , and the 2-norm of β is represented by the \( |\vec{|β}{||^{2}} \) . As a penalty factor, when it is larger, β tends to become smaller and approach 0.
2.2. AdaBoost in Nonlinear Models
The Adaboost model is an alternative binary classification supervised learning method with an exponential loss function. It utilizes a forward distribution algorithm as the learning algorithm and achieves higher prediction accuracy compared to weak classifiers based on random patterns [5]. The initialization weights vary as the recognition error rate of samples increases. It distinguishes different classifiers by reducing the weights of correctly classified samples. Through iterative updates and weight incorporation, it eventually obtains the classifier \( {G_{w}}(x) \) . The model is composed of maximum depth (decision tree at the bottom), the number of iterations, and the combination of weights of weak classifiers (Formula (4)).
\( f(x)=\sum _{w=1}^{w}(ln{\frac{1}{{α_{w}}}}){G_{w}}(x) \) (4)
3. Research Methods
3.1. Data Collection
In data collection, factors are divided into four major categories based on the mainstream macro concepts: valuation, financial environment, market sentiment, and economic fundamentals. Valuation factors include the PB, PE, and PS values of major broad-based indexes, sourced from Wind Information (Wind). Financial environment factors include indicators that reflect domestic and international market stability, such as exchange rates and important global indexes like the S&P 500 and NASDAQ. Market sentiment factors reflect changes in market participants' sentiment and social fund demand, such as margin financing and securities lending balances and trading volumes of major exchanges. Economic fundamentals factors include CPI and PPI, which mainly reflect the current macroeconomic conditions and economic cycles, such as data representing inflation and real estate transaction volumes that reflect economic momentum.
3.2. Data Processing
A total of 105 factors were collected for the four macro dimensions. To analyze the trend of indexes, the data underwent a uniform differencing process. The data frequency includes daily and monthly data, spanning from January 2017 to December 2022, totaling 6 years. To increase the amount of data for analysis, the data frequency was unified to daily, and monthly data was mapped to daily frequency. Additionally, the data underwent preprocessing steps such as lagging, outlier removal, and standardization. Lagging ensures that the data of a certain lag corresponds to the changes in the underlying asset of the previous lag, making the overall data predictive. Outliers were replaced using a three times standard deviation approach to prevent extreme data from causing bias and distortion in the model, thereby increasing model robustness. The standardization process helps unify the scale of variables and facilitates cross-sectional analysis. In this study, the information coefficient (IC) was also calculated to gain a rough understanding of the magnitude and direction of the factor's impact on future returns. By processing the difference direction of the factor based on the positive or negative direction of the IC feature indicator, the data generated correct opening signals.
3.3. Testing Procedure
The construction of the model includes steps such as feature selection, data preprocessing, model parameter selection, and model training to predict the trend of industry indices for the next trading day. During the feature selection process, an initial screening is performed using the Information Coefficient (IC) test to observe the strength of the correlation between macro factors and industry indices, and factors with low correlation with price changes are removed. In data preprocessing, a three-fold standard deviation approach is used to prevent extreme values from adversely affecting the effectiveness of the model.
In this study, both linear and non-linear models mentioned above are employed, and the models are trained and tested multiple times using cross-validation to find the optimal model parameters. The backtesting framework is based on tradable industry indices and allows long and short combinations. The closing prices of the indices are used to calculate the corresponding net value based on buy and sell prices. The backtesting model results are evaluated based on metrics such as model accuracy, Sharpe ratio of net value, Calmar ratio, and maximum drawdown.
4. Results and Analysis
By calculating the Information Coefficient (IC), we can understand the predictive ability of individual factors in the logistic model. When observing valuation factors and their correlation with the rate of change in different industry indices, we find a negative correlation. A similar pattern is observed for market sentiment factors. In contrast, factors related to financial environment and economic fundamentals show slightly lower IC values compared to valuation and market sentiment factors, but most of them have positive IC values.To increase the probability of correctly predicting the direction of opening positions for individual factors in the classification model, adjustments were made to the positive and negative values of factor changes according to the tested IC direction. Regarding the logistic model, the Defense and Militarye and Building Materials sectors showed the highest cumulative returns in the training set, with returns of 98.81% and 89.41% respectively. The sectors with the highest Sharpe ratios were Transportation and Conglomerates, with Sharpe ratios of approximately 1.129 and 1.123, respectively. Looking at the testing set, Transportation and Conglomerates also showed relatively good maximum drawdown, indicating better generalization ability of the logistic model compared to other industry indices. (Table1)
In the adaboost model, the industry indices with the highest cumulative returns, Electronics and Non-ferrous Metals, outperformed the logistic model, with returns of 119.958% and 123.715% respectively. However, the overall Sharpe ratio in the adaboost model was lower than the logistic model, suggesting that the risk control capability of the adaboost model in predicting industry indices is inferior to that of the logistic model. (Table2)
5. Conclusion
Starting from machine learning models, the above experiments tested the logistic regression and AdaBoost, two common linear and non-linear models, on industry indices. Based on the experimental results, it was observed that, in terms of model performance, in situations with limited data, linear models exhibit higher stability in predictions compared to non-linear models. Specifically, when the four major macro factors were used as inputs, both models showed relatively better overall performance for the Transportation industry index. In the later stages of the experiment, further optimization of the models will be carried out to explore the capabilities of machine learning in handling large-scale and non-linear data, providing new investment insights for asset allocation in different asset classes.
References
[1]. Zhang, Y., & Wang, L. (2022). "Trends in Factor Investing and A-Share Market Dynamics," Journal of Financial Markets Research.
[2]. Sharpe, W.F. (1964). "Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk," Journal of Finance.
[3]. Ross, S.A. (1976). "The Arbitrage Theory of Capital Asset Pricing," Journal of Economic Theory.
[4]. Nguyen, H., & Lee, J. (2019). "Factor Selection in Equity Markets: Using Logistic Regression and Regularization," Journal of Quantitative Finance.
[5]. Smith, A., & Zhao, X. (2020). "Enhancing Financial Prediction Models with AdaBoost Algorithm," Finance and Data Science
Cite this article
Luo,M. (2024). Application of Machine Learning Models in Asset Allocation. Advances in Economics, Management and Political Sciences,83,75-80.
Data availability
The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.
Disclaimer/Publisher's Note
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
About volume
Volume title: Proceedings of the 2nd International Conference on Management Research and Economic Development
© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and
conditions of the Creative Commons Attribution (CC BY) license. Authors who
publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons
Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this
series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published
version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial
publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and
during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See
Open access policy for details).
References
[1]. Zhang, Y., & Wang, L. (2022). "Trends in Factor Investing and A-Share Market Dynamics," Journal of Financial Markets Research.
[2]. Sharpe, W.F. (1964). "Capital Asset Prices: A Theory of Market Equilibrium under Conditions of Risk," Journal of Finance.
[3]. Ross, S.A. (1976). "The Arbitrage Theory of Capital Asset Pricing," Journal of Economic Theory.
[4]. Nguyen, H., & Lee, J. (2019). "Factor Selection in Equity Markets: Using Logistic Regression and Regularization," Journal of Quantitative Finance.
[5]. Smith, A., & Zhao, X. (2020). "Enhancing Financial Prediction Models with AdaBoost Algorithm," Finance and Data Science