Influential Factors on California Regional Housing Price Analysed by Multiple Linear Regression

Yanqian Cao

doi:10.54254/2754-1169/77/20241802

1. Introduction

Housing price is one of the most important topics of society since the demand for residential properties has increased dramatically in the last century because of the surge in population growth rate and higher economic development. One of the phenomena in the housing market is that the prices of residential properties can vary considerably, although they are in the same city or province. It is important to figure out what factors can make this difference in housing prices because this can be used as a guide for future city planning and residential property construction to meet citizens’ demands better.

This paper focuses on the relationship between regional median housing prices and some characteristics of the region, such as population and average income. A multiple linear regression (MLR) model will be constructed in this essay for market price prediction and to select the most influential factors that can affect housing prices for residential properties in different districts of California.

The research on housing prices supports the hypothesis that real prices of residential properties have a positive correlation with the real incomes in a region by using a spatial-temporal model [1-2]. It has also been found that the age of the apartment is negatively correlated with its market price, in both global and local regression models [3]. Population density has a positive correlation with housing prices, as found by regional regression in 285 Chinese cities [4]. Apart from the factors verified by these researchers, the number of households and the total number of rooms will also be included in this model as regressors since differences in family sizes can result in varied evaluations of property prices.

R will be used to construct different MLR models with different regressors and make a comparison between them to select the one with the greatest explaining power and the smallest variance inflation factor (VIF). The selection process will be described in the section on model development. The final model will contain the most significant regressors and their beta coefficients to show how one unit change in those factors will affect the regional housing price.

2. Exploratory Data Analysis

2.1. Data Resource

The data set used for this essay is loaded from Kaggle, containing 20433 entries of data for the housing market in California, with 10 categories of information. After omitting all the null values, 500 rows from this dataset would be selected and divided into test data and training data randomly to develop the MLR model.

2.2. Description of Data

Table 1: Comparison between Train and Test Dataset for Numerical Categories

	Minimum		Maximum		Mean		Median		1st Quantile		3rd Quantile
	Train	Test	Train	Test	Train	Test	Train	Test	Train	Test	Train	Test
Median Housing Value	42500	44500	500001	500001	209145	204742	182500	182350	130925	128900	268825	256400
Population	9.0	195.0	9623.0	10450	1562.9	1421.1	1196.0	1107.0	778.2	776.8	1909.2	1747.5
Median Income	0.536	1.055	12.590	10.579	3.923	4.006	3.772	3.870	2.576	2.633	4.848	4.842
Total no. of Rooms	11	32	22128	21897	2899	2742	2231	2032	1494	1446	3326	3383
Total no. of Bedrooms	7.0	71.0	3513.0	3522.0	583.4	550.4	460.5	411.5	296.8	284.0	1909.2	1747.5
Median Housing Age	4.00	2.00	52.00	52.00	28.12	27.81	29.00	28.00	18.00	17.00	36.00	36.00
Households	3.0	7.0	3285.0	2873.0	526.1	542.8	408.0	424.0	282.0	278.2	627.5	691.5

/word/media/image1.jpeg

Figure 1: Histograms and Boxplots for Important Data

Characteristic information such as minimum, maximum, median, and mean values of the numerical category have been summarized in Table 1. It can be seen from Table 1 that the mean and the quantiles of both the train and test datasets are very close; this is because both are randomly selected from the original large dataset to make the result from model construction from the training dataset more general and can be applied on the test dataset for the final result.

Figure 1 presents four histograms and two boxplots for the important numerical data, including the response value (average housing prices), and the regressors that will be used in the model. It can be seen from the graph that the median housing value and the median income are more normally distributed than the median housing age and population. However, it can be seen from the box plots that there are more large outliers in median income than in the median housing age, and the disparity of average income is overall smaller than the median age. It is noticeable that the unit of the median income is in thousand dollars.

2.3. Testing the Important Predictors for Correlations

The scatter plots with the best-fit lines of median house value with respect to median house age and median income respectively are plotted to investigate the correlations between them. From Figure 2, it is obvious that both correlations are positive, but the median income has a stronger correlation with the median housing value than the median housing age because the points in the right graph are much closer to the best-fit line; both are acceptable to be added to the first model.

/word/media/image2.jpeg

Figure 2: Scatter Plots of the Median House Value with respect to Median Housing Age and Median Income Respectively

3. Model Development

3.1. Model Construction Process and Selection

Table 2: Final Results and Indicators of the Four Models

	Training Models			Testing Model
	Model 1	Model 2	Model 3	Model 4
Median Income	46955	45309	45471	43523
Median Housing Age	2380.84	2460.26	2685.38	1969.243
Population	-50.81	-53.77		-45.739
Households	218.83	190.80	46.40	172.787
Total Rooms	-6.977
Intercepts	-61205	-57233	-67456	-49033
Adjusted R-squared	0.592	0.592	0.545	0.637
AIC	8051.4	8051.3	8089.3	3094.1
Mean VIF	6.718	4.16	1.079	4.52

By using the predictors introduced before, R will be used to generate Model 1 by multiple linear regression of median housing price on median housing age, population, number of households, median income, and the total number of rooms. The result shows that the p-value of the total number of rooms is 0.182, which is larger than the 0.05 significance level, which means the hypothesis for the coefficient in front of it to be zero is supported; so this predictor should be removed and the next step is to try a new regression for the same responsory on the other four predictors, and this is the model 2. The result shows that all four predictors in Model 2 have less than 0.05 p-values, which means all of them are significant predictors.

After that, predictor population is removed from Model 2 to construct Model 3 and conduct a partial F test to find out which of them is better. This is because the predictor population may cause multicollinearity issues with the household predictor. After conducting the test, the p-value is 3.413e-10, which is smaller than the 0.05 significant level, so Model 2 is a better model. After conducting a stepwise selection by R, the result shows Model 2 has the smallest Akaike information criterion(AIC) of all the models; therefore, Model 2 is chosen to be the final model here. As shown in Table 2, the mean VIF of Model 2 is less than 5, so there is no multicollinearity problem in Model 2.

3.2. Apply Final Model on Test Dataset

After selecting the final model to be Model 2, it will be applied on the test dataset to find out whether the model is generalized for all datasets. The model using the regression form of Model 2 and the test dataset is Model 4; As shown in Table 2, the explaining power of Model 4 is the greatest, and the mean VIF is also smaller than 5, so there does not exist multicollinearity issues in Model 4 [5].

From Table 2, it can be seen that the correlations between the four predictors and the median housing price stay on the same sign, and the coefficients in front of them are close. Therefore, Model 2 is generalizable to the test dataset, and Model 4 can be used for conclusion if all assumptions are satisfied.

3.3. Model Diagnostics

In this section, it is aims to find out if Model 2 and Model 4 satisfy the assumptions of multiple linear regression by using the method from the book “A Modern Approach to Regression with R” [6]. All four graphs are very similar for both models.

Firstly, as shown in Figure 3 and Figure 4 below, the residuals of both models fit the line well, so the residuals of Model 2 and Model 4 are normally distributed. For the residual plots, there are no distinct patterns for residuals spreading around the horizontal line, and there are no clusters or fanning patterns in the plot. Therefore, linearity and independence are also satisfied. For the Scale-Location plots, points are spread equally with a horizontal line, so the constant variance is also satisfied for both models.

From the residual versus leverage plots, although there are more influential points in Model 2 than in Model 4, there is no reason for them to be removed.

/word/media/image3.jpeg

Figure 3: Plots for Model 2 Diagnosis

/word/media/image4.jpeg

Figure 4: Plots for Model 3 Diagnosis

4. Conclusion

This paper aims to find out the most influential factors for the regional housing price in California by using the method of MLR. From the analysis above, the final model produces very close outputs using both the training dataset and the testing dataset. Therefore, it is reasonable to accept the interpretations from Model 3 for the relationships between influential factors and housing price itself.

Based on the result in Table 2, the interpretation can be seen as keeping all other factors unchanged, one unit change in median income, median housing age, households, or population will result in 45309, 2460.26,190.80, -53.77 units of change in regional median housing price respectively.

Therefore, if the government wants to lower the housing price for one region, it can build more residential properties there to decrease the median housing age. This policy works because supply changes faster than demand in this case and lowers the regional housing prices. With all other factors unchanged, when people are considering where they are going to purchase a residential property, they can compare their income with the regional median income for more rational decisions.

However, there remain limitations to this MLR research. Firstly, the total area of each district is unknown in this dataset, so the population density is unknown, and population and the number of households are used instead. This may lead to multicollinearity when the model is applied to other datasets and leaves the relationship between regional housing prices and population density ambiguous. Moreover, this dataset only contains data from California, so its results may not be generalized to other areas or countries. However, this method of analysis is transferable to other investigations in other areas, such as marketing and social services, as long as the assumptions of MLR are satisfied.

References

[1]. Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297. https://www.kaggle.com/datasets/camnugent/california-housing-prices

[2]. S. Sisman, A.C. Aydinoglu “A modeling approach with geographically weighted regression methods for determining geographic variation and influencing factors in housing price: A case in Istanbul” https://doi.org/10.1016/j.landusepol.2022.106183

[3]. Sean Hollya, M. Hashem Pesarana, Takashi Yamagata “A spatio-temporal model of house prices in the USA” doi: 10.1016/j.jeconom.2010.03.040Yanchao Feng “Examining the determinants of housing prices and the influence of the spatial–temporal interaction effect: The case of China during 2003–2016” https://doi.org/10.1016/j.cjpre.2021.04.013

[4]. Yanchao Feng “Examining the determinants of housing prices and the influence of the spatial–temporal interaction effect: The case of China during 2003–2016” https://doi.org/10.1016/j.cjpre.2021.04.013

[5]. Noora Shrestha, Detecting Multicollinearity in Regression Analysis, American Journal of Applied Mathematics and Statistics. 2020, 8(2), 39-42. DOI: 10.12691/ajams-8-2-1

[6]. Simon J. Sheather. A Modern Approach to Regression with R. ISBN:978-0-387-09608-7. page 151.

Cite this article

Cao,Y. (2024). Influential Factors on California Regional Housing Price Analysed by Multiple Linear Regression. Advances in Economics, Management and Political Sciences,77,48-53.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 3rd International Conference on Business and Policy Studies

ISBN：978-1-83558-377-7(Print) / 978-1-83558-378-4(Online)

Editor：Arman Eshraghi

Conference website: https://www.confbps.org/

Conference date: 27 February 2024

Series: Advances in Economics, Management and Political Sciences

Volume number: Vol.77

ISSN：2754-1169(Print) / 2754-1177(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).

References

[1]. Pace, R. Kelley, and Ronald Barry. "Sparse spatial autoregressions." Statistics & Probability Letters 33.3 (1997): 291-297. https://www.kaggle.com/datasets/camnugent/california-housing-prices

[5]. Noora Shrestha, Detecting Multicollinearity in Regression Analysis, American Journal of Applied Mathematics and Statistics. 2020, 8(2), 39-42. DOI: 10.12691/ajams-8-2-1

[6]. Simon J. Sheather. A Modern Approach to Regression with R. ISBN:978-0-387-09608-7. page 151.