Analyzing Factors of Stock Price Index and Using Simple Linear Regression Techniques

Hongjin Xiang

doi:10.54254/2754-1169/20/20230206

1. Introduction

As an important indicator of economic and financial activities of a country or region, stock price indexes are of great research importance. The relationship between macroeconomic factors, monetary policy and asset prices has always been the focus of attention in financial economics. The influence of stock price fluctuations is one of the directions that this paper would like to try to analyze and understand. For investors, the analysis of macroeconomic indicators is a very important part. This paper is divided into two parts, the first part is an exploratory data analysis of macro factors and the possibility of some of them having an impact on the HS300 index [1]. The second part is an attempt to learn how to use simply linear regression techniques using the case of individual stocks and the HS300. In addition, this thesis classifies the factors into macro environment indicators [2], such as GDP, CPI, industrial index, etc., and market valuation indicators, such as market average PE, PS, etc. HS300 is used as a benchmark to analyze the correlation between macro factors and HS300 volatility descriptively.

Brasoveanu L O points out that with the development of financial market and securities market, more and more people pay attention to the price movement of stocks and get the factors affecting the movement of stock prices by analyzing the impact of macro and micro economic environment on stock prices [3]. Evangelia Papapetrou argues that these factors include not only the current business situation of the company, but also the trends of the stock industry and the macro economy as a whole [4]. In addition, Li, H. analyses different listed robotics companies and finds that the share of R&D expenses, the growth of GWM and the growth of marketing expenses in the robotics industry have different degrees of influence on stock prices [5]. Zhou, X explores the impact of terms of trade, oil prices, interest rates, money supply and industrial production indices on stock delivery movements through a study of stock markets and firms in the US, Japan and China [6]. Bochen Li takes two emerging Internet companies, Alibaba and Jingdong, as the subjects of his study (using linear regression methods to analyze their stock movements), and then examines the factors influencing the stock market in this industry and provides investors with investment intentions to reduce loss expectation losses [7].

2. Data Profile

Because I have a traditional business background and little understanding of the deep knowledge structure related to statistical mathematics, this paper uses some basic graphs to depict the data as follows.

Data distribution graph: it is to view the data distribution, mainly to determine the normality, skewness kurtosis of the data.

Scatter plot: to view the relationship between data and data; between data and HS300. Mainly looking for linear relationships, and thinking about how to transform non-linear relationships

Heat distribution diagram: view the correlation between data and look for macro factors with strong correlation with HS300

This report’s dataset is obtained from ricequant's stock financial data; Wind's EDB macro database, and information from relevant financial websites. Because sectoral data are lagged, daily data for some indicators are taken with a one-day lag and monthly data are taken with a two-month lag. Quarterly data are only available for GDP and unemployment rate, which are taken with a one-quarter lag. In this paper, we analyze daily data with a one-day lag and use the previous day's data to forecast; monthly data with a one-month lag and use the previous month's end data to forecast.

The types of analysis graphs used in this paper include:

Table 1: Data dictionary.

Column Name	Column Meaning	Explanation	Attributes
circulating_market_cap	Market capitalization in circulation (billion yuan)	Outstanding market value refers to the total value of shares outstanding at a given time derived by multiplying the number of shares tradable at that time by the prevailing share price.	Circulation
turnover_ratio (%)		It refers to the frequency of stock changing hands in the market within a certain period of time, and is one of the indicators of the strength of stock liquidity.	Circulation
Table 1:(continued).
pe_ratio	PE/TTM	Market price per share is a multiple of earnings per share, reflecting the price investors are willing to pay for each dollar of net profit, and is used to estimate the reward and risk of investing in a stock	Valuation
pb_ratio	PB	Ratio of share price per share to net assets per share	Valuation
ps_ratio	PS/ TTM	The smaller the price-to-sales ratio, the higher the value of the investment is usually considered.	Valuation
ppi	Production material price index		Price
qm	Extractive industry price index		Price
rmi	Raw Material Industrial Price Index		Price
pi	Process Industry Price Index		Price
cg	Price index of living materials		Price
GDP_PRIMARY	Gross Domestic Product - Primary Sector		Price
GDP_SECONDARY	Gross Domestic Product-Secondary Industry		Price
GDP_TERTIARY	Gross Domestic Product-Tertiary Sector		Price
GDPYOY	GDP year-on-year		Price Growth
GDP_PRIMARYYOY	GDP year-on-year - Primary sector		Price Growth
GDP_SECONDARYYOY	GDP year-on-year - Secondary sector		Price Growth
EXPECTATIONIDX	Consumer expectations index		Price
Table 1:(continued).
SATISFACTIONIDX	Consumer Satisfaction Index		Price
CONFIDENCEIDX	Consumer Confidence Index		Price
BOOMIDX	Business Prosperity Index		Price
CONFIDENCEIDX	Entrepreneur Confidence Index		Price
PMI	Manufacturing Purchasing Managers' Index		Price
HOUSEFUND	Personal Housing Provident Fund Deposit - Current Year Contribution		Price
LOAN6MONTH	Short-term loans - up to and including six months		Price
HOUSEFUND5YEAR	Personal Housing Provident Fund loans - up to and including five years		Price
INDIVIDUALHOUSE6MONTH	Commercial banks' captive personal housing loans - up to and including six months		Price

3. Exploratory Data Analysis (EDA)

First, I think of EDA methods as not just a set of techniques or functions or graphs, but as an attitude/philosophy about how to do data analysis. EDA is a data analysis method that infers the usual assumptions about what models the data follow by allowing the data itself to reveal its underlying structure and model in a more direct way [8]. Besides, EDA is not just a collection of techniques., but it is a philosophy about how we dissect EDA is a philosophy about how we dissect datasets; what we are looking for; what they look like; and how we can better explain their intrinsic connections. It is true that EDA makes extensive use of a collection of techniques we call "statistical graphics," but it is not exactly the same as statistical graphics themselves(see Table 1).

Most EDA techniques are inherently graphical, with some quantitative techniques. The reason for this heavy reliance on graphs is that EDA's primary role is essentially open to exploration, and graphs give analysts an unparalleled ability to explore, to entice data to reveal its structural secrets, and to be ready to gain some new, often unexpected, insight into the data. Combined with the natural pattern recognition capabilities we all possess graphs certainly provide an unparalleled ability to do just that.

Thus, the first step of doing data analysis is to find out the missing data, distortion, and misrepresentation; after that, the integrity of data usage is ensured by cleaning and filling the data. Last, the real relationship between the data has to be found so that a suitable model can be selected. The following steps are the data analysis done in this project.

3.1. View Missing Values

The data processing starts with the missing data, which are listed below in Table 2.

Table 2: Dictionary of missing values.

Missing Data	Quantity
dates	0
open	61
close	61
PB	1
leading_idx	34
retail_sin	241
fixed_assets_investment	259
ppi	112
rmi	112
pi	112
cg	112
food	133
clothing	112
roeu	112
dcg	112
retail_sin_rate	366
fixed_assets_investment_rate	259
CPI_rate	34
GDP_rate	51
dtype:	int64

/word/media/image1.png

Figure 1: Stock factor categories.

From the Figure 1 shows that the total data volume is 3150, the fixed investment data ('fixed_assets_investment') and retail sales data ('retain_sin') are the most missing. The industrial ex-factory indexes ('ppi', 'food', 'clothing') also have missing values for almost 4 months or so, while other data such as ( pmi, import) are missing for less than two months. By studying the macro database I found that: fixed investment data ('fixed_assets_investment') data do not include all January data; retail sales data ('retain_sin') do not include Data for January and April after 2011, which account for the bulk of the missing values. The industrial ex-factory index is occasionally missing data for one or two months in the last 10 years, and there are some missing data. In addition, looking at the database, I also found that there are many cases of NaN in the Ex-Industrial Index database, which should be noted in the data cleaning. Finally, the missing data of pmi, import and so on are caused by the current time of this month and the previous month data are not entered; the missing data of open and close are caused by the missing data of open and close of CSI 300 in January-April 2005.

In terms of missing data treatment, for the missing issue of opening and closing prices in January-April 2015, this thesis removes this data and does not count it in the data sample; for 'fixed_assets_investment', 'retain_sin ' and other data in the industrial ex-factory index PPI library, this thesis replaces them with data from the previous month; finally December and November 2017 are not counted in the data sample analysis to deal with the missing values of pmi, import and other data.

3.2. Distribution of Data Values

I need to analyze the distribution of HS300 price data to know the overall situation of the data, the distribution chart is as follows

/word/media/image2.png

Figure 2: HS300 data distribution.

The Figure 2 shows the skewness and kurtosis of the price distribution are calculated as

kewness: close 0.175869

kurtosis: close 0.341954

The skewness of 0.175869 indicates that the prices are slightly skewed to the right; the kurtosis of 0.341954 indicates that the kurtosis is not high. Because the skewness of the normal distribution is 0 and the kurtosis is 3, while the kurtosis of the stock price of HS300 differs greatly from the normal distribution, it can be judged that the stock price of HS300 does not obey the normal distribution. The stock index probability distribution is not on the red mean, the highs are skewed high and the lows are skewed low. Next, the distribution of the daily growth of HS300 is analyzed as shown below

/word/media/image3.png

Figure 3: HS300 daily price change distribution.

The Figure 3 illustrates the price changes (Change_rate) skewness kurtosis are-0.278169 and 4.178902. The skewness is -0.278169, which means that the price is slightly to the left and the center of gravity is to the right; the mean value is around 0; the kurtosis is 4.178902, which is close to the normal distribution. It is not possible to have a linear correlation with the stock price.

3.3. Single Factor Analysis

/word/media/image4.png

Figure 4: Single-factor distribution.

The Figure 4 shows different single-factor analysis and data distribution.

PB: PB is right biased, the center of gravity is on the left, and the kurtosis is also slightly higher, so it can be log transformed to make it conform to unbounded Johnson distribution, and it can also be standardized.

PCF: PCF almost no skewness, kurtosis is very high, not suitable for standardization operation, but can reduce its kurtosis, but because the traditional method of reducing kurtosis log is not good for the case with negative numbers, you can first pan with the minimum value, and then log

PE: skewness is positive, the magnitude is not large; kurtosis is high, but not particularly high, you can log processing to adjust the skewness

PS: the skewness is positive, the magnitude is not large, the kurtosis is similar to the normal distribution, no need to deal with

pmi: higher kurtosis, can do log processing, less need for special processing

turnover ratio: The extreme value of 3% for each of the maximum and minimum is deleted when calculating the market mean. turnover_ratio distribution skewness is positive, the magnitude is large, can be log processed

Macroeconomic index (leading index) leading_idx: does not obey normal distribution, but its distribution is extremely similar to HS300 stock price distribution, which may have a strong linear relationship.

Total retail sales of social consumer goods retail_sin: the skewness is positive, the magnitude is not big, the kurtosis is not big, can't see what special treatment is needed.

Fixed asset investment completion (cumulative year-on-year): positive skewness, high kurtosis, not suitable for log to reduce skewness due to a large number of values around 0. This factor is not meaningful to use directly, and I would like to try to use proportional change as a substitute.

/word/media/image5.png

Figure 5: single-factor distribution.

CPI: right skewed skewness, high kurtosis, suitable for log processing

M1: It is recommended to represent the data with change proportion

M2: Suggest to use the change proportion to represent the data

Unemployment rate URUR: The data comes from Wind database, the data is obviously not in line with common sense, according to the unemployment rate of the United States or other countries, the unemployment rate should fluctuate around 3%-8% is more normal, during the economic crisis may exceed 10%, a major economic crisis such as the 1932 economic crisis may be more than 20%. 2005 to 2017 12 years there was a major financial crisis. The time span is also very large, but the unemployment rate fluctuates but not more than five thousandths, the data distortion is serious(see Figure 5).

Total imports IMPORT: not normally distributed, but similar to the stock price distribution, probably linearly related to the stock price

Total exports EXPORT: the same as above, and the import correlation may be very large

Total import and export TOTAL: same as above

Import year-on-year IMPORTYOY: probably right skewed, with high kurtosis, can do log processing.

Exports year-on-year EXPORTYOY: left skewed, higher kurtosis, can do log processing or square processing. Compared with imports, there are few values higher than 50, which means that the increase is steady and there is no sudden big leap.

3.4. Heat Map

Heat map, a statistical chart that displays data by coloring blocks of color. When plotting, rules for color mapping need to be specified. For example, larger values are represented by darker colors and smaller values by lighter colors; larger values are represented by warmer colors and smaller values by cooler colors, etc. The advantage of heat maps is that they are "space efficient" and can accommodate relatively large amounts of data. Heat maps are not only useful for discovering relationships between data and finding extreme values, but they are also often used to paint a picture of the data as a whole and facilitate comparisons between data sets. Thus, the 30 most important factors for stock prices were selected because log and sqrt were used, and there were duplicate values in them, which actually took about 10 important variables. The group also tried to organize the data by standardized methods, but the results were poor on the factors with high importance, so these factors were deleted.

/word/media/image6.png

Figure 6: Stock factors correlation analysis.

According to the Figure 6 heat map it can be seen that the financial data of the company are more correlated with each other, other macro data are more correlated with each other and less correlated with both parts of the data. It is worth noting that CPI has a low correlation with exports, M1 and M2 macro data indicators, and even has a negative linear relationship of 20% with “retail_sin”. However, CPI has a high positive correlation with consumer “dcg”(durables price index).

4. Linear Regression Analysis

The HS300 index is jointly compiled by Shanghai Stock Exchange and Shenzhen Stock Exchange, and the sample covers about 60% of the market capitalization of Shanghai and Shenzhen markets, which has good market representativeness. Therefore, in addition to the exploratory analysis of HS300 related factors, this paper also investigates whether there is a linear relationship between the stock prices of individual stocks and HS300, and tries to conduct correlation analysis based on the OLS regressions learned in the course.

4.1. Data Describe

Table 3: Data descriptive statistics.

Count	146.000000	Count	146.000000
Mean	3864.704901	Mean	78.942740
Std	548.784075	Std	16.382054
Min	3025.692000	Min	48.280000
25%	3527.016550	25%	70.195000
50%	3761.664300	50%	78.475000
75%	4054.033450	75%	85.815000
Max	5353.751000	Max	128.000000
Name:	close, dtype: float64	Name:	close, dtype: float64

The Table 3 simple statistics of the closing price of HS300 for each trading day in the statistical interval, are Figure 7 shown below. There are 146 HS300 share price data, the average value of the share price is 3864.70, the minimum value is 3025.69 and the maximum value is 5353.75. And the simple statistics of the closing price of Robotics for each trading day are shown below. There are 146 share price data, the average value of the share price is 78.942740, the minimum value is 48.280000, and the maximum value is 128.0

/word/media/image7.png

Figure 7: HS300 index&robot company volatility distribution.

We can observe that Figure 7 shows the stock price volatility of HS300 and Robotics, as shown above, has a relatively consistent trend between the stock price volatility of HS300 and Robotics, while Robotics has a more volatile stock price compared to HS300.

/word/media/image8.png /word/media/image9.png

Figure 8: HS300 index&robot company daily returns distribution.

Drawing the histogram and density plot of daily returns of HS300 and Robotics, as Figure 8 shown above, it can be found that, in general, the daily returns of HS300 and Robotics obey normal distribution. Comparatively, the daily return of robotics companies is lower than that of HS300.

/word/media/image10.png

Figure 9: HS300 index&robot company correlation distribution.

The Figure 9 scatter plot shows that there may be a linear positive correlation between HS300 and Robotics' stock price.

4.2. Regression Results

Linear regression is a data analysis technique that predicts the value of an unknown data by using another relevant known data value [9]. It mathematically models the unknown or dependent variable and the known or independent variables as linear equations. Essentially, a simple linear regression technique attempts to plot a line graph between two data variables x and y. As the independent variable, x is plotted along the horizontal axis. The independent variable is also called the explanatory or predictive variable. The dependent variable y is plotted on the vertical axis. The y value can also be indexed as a response or predictor variable.

Since my own background is in Business, I am familiar with the framework of investment reasoning and value investing, not really good at math or data engineering oriented content. After several weeks of lectures, I tried to use a simple regression model to analyze the correlation between the HS300 and individual stock relationships.

Table 4: OLS regression results.

Dep.Variable:		Robot Company			R-squared:			0.457
Model:		OLS			Adj.R-squared:			0.453
Methods:		Least Squares			F-statistic:			120.4
Date:		Fri,25 Nov 2022			Prob(F-statistic):			1.07e-20
Time:		11:25:55			Log-Likelihood:			262.41
No.Observations:		145			AIC:			-520.8
Df Residuals:		143			BIC:			-514.9
Df Model:		1
Covariance Type:		nonrobust
	Coef		Std err	T		P>\|t\|	[0.025		0.975]
HS300	1.3137		0.120	10.971		0.000	1.077		1.550
Intercept	0.0002		0.003	0.069		0.945	-0.006		0.007
Omnibus:		4.637			Durbin-Waston:			1.952
Prob(Omnibus):		0.098			Jarque-Bera(JB):			6.260
SKEW:		-0.043			Prob(JB):			0.0437
Kurtosis:		4.014			Cond.No.			36.2

The Table 4 least squares regression results show that there is a significant positive relationship between the daily stock returns of robotics companies and the HS300 daily returns. The decidable coefficient is 0.457, indicating that the HS300 daily return variable has a strong explanatory power on the robotics daily return variable and the model fits well. p-values for both the F-statistic and Omnibus statistic are close to 0, and the effect of the independent variable is significant. p-values for the t-statistic are close to 0, indicating that the HS300 variable is significant. The coefficient of the independent variable is 1.3137, indicating that the daily return of the stock of Robotics is more volatile than HS300 and that the individual stock is riskier and has greater potential for gains and losses. The average HS300 daily return fluctuates by 1% and the individual stock daily return fluctuates by 1.3137 %. the value of the Durbin-Waston test is 1.952, indicating that there is no serial correlation in the return data.

5. Conclusion

With the development of globalization of capital market, the importance of Chinese securities market in the international capital market has become more and more prominent. Pham, T. A think the Asian capital market is developing at a rapid pace and that the stock market is a very important part of the capital market When different price indices (HS300) are designed, investors will have more underlying objects to study financial market liquidity, systemic risk and stock price volatility [10]. The research in this paper includes an exploratory analysis of macroeconomic indicators and finds that some factors such as: PB, PE, pmi CPI retail sales of social consumer goods, turnover ratio, etc. have a correlation effect on the HS300 index. In addition, this paper does a simple linear regression analysis on the share price data of individual stocks (robotics companies) and the closing price data of HS300, and the results have a linear relationship.

References

[1]. Bu H, Pi L. Does investor sentiment predict stock returns? The evidence from Chinese stock market[J]. Journal of Systems Science and Complexity, 2014, 27(1): 130-143.

[2]. Brown D H, MacBean A. Introduction: China’s macro environment and enterprise challenges[M]//Challenges for China's Development. Routledge, 2005: 17-27.

[3]. Brasoveanu L O, Dragota V, Catarama D, et al. Correlations between capital market development and economic growth: The case of Romania[J]. Journal of applied quantitative methods, 2008, 3(1): 64-75.

[4]. Hondroyiannis G, Papapetrou E. Macroeconomic influences on the stock market[J]. Journal of economics and finance, 2001, 25(1): 33-49.

[5]. Cheng H, Jia R, Li D, et al. The rise of robots in China[J]. Journal of Economic Perspectives, 2019, 33(2): 71-88.

[6]. Wong A, Zhou X. Development of financial market and economic growth: Review of Hong Kong, China, Japan, the United States and the United Kingdom[J]. International Journal of Economics and Finance, 2011, 3(2): 111-115.

[7]. Zhang L, Fu S, Li B. Research on stock price forecast based on news sentiment analysis—A case study of alibaba[C]//International Conference on Computational Science. Springer, Cham, 2018: 429-442.

[8]. Morgenthaler S. Exploratory data analysis[J]. Wiley Interdisciplinary Reviews: Computational Statistics, 2009, 1(1): 33-44.

[9]. Montgomery D C, Peck E A, Vining G G. Introduction to linear regression analysis[M]. John Wiley & Sons, 2021.

[10]. Huy D T N, Loan B T T, Pham T A. Impact of selected factors on stock price: a case study of Vietcombank in Vietnam[J]. Entrepreneurship and Sustainability Issues, 2020, 7(4): 2715.

Cite this article

Xiang,H. (2023). Analyzing Factors of Stock Price Index and Using Simple Linear Regression Techniques. Advances in Economics, Management and Political Sciences,20,268-281.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 2023 International Conference on Management Research and Economic Development

ISBN：978-1-915371-83-6(Print) / 978-1-915371-84-3(Online)

Editor：Canh Thien Dang, Javier Cifuentes-Faura

Conference website: https://2023.icmred.org/

Conference date: 28 April 2023

Series: Advances in Economics, Management and Political Sciences

Volume number: Vol.20

ISSN：2754-1169(Print) / 2754-1177(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).