Application of Statistical Models in Air Quality Monitoring

Ziyi Han

doi:10.54254/2753-8818/2025.19945

1. Introduction

The effective use of statistical models is critical for analysing atmospheric pollution and providing solutions in environmental science. With the growth of urbanisation and industrialisation, ecological complexity and the severity of environmental pollution have increased sharply, raising the demand for precision in statistical analysis models. Traditional models, such as multiple linear regression and time series models, have provided a foundation for analysing air pollution data and have been effective to some extent in examining pollutant trends and their impacts on human health[1-2]. The advent of big data has introduced both opportunities and difficulties to environmental science. Sources such as satellite imagery, meteorological stations, and IoT sensors now generate vast amounts of data, continually increasing data diversity. To achieve a more comprehensive and accurate analysis, it is clear that improvements in computational capabilities and structural frameworks are required[3]. Improved computational capacity and innovative algorithm optimisation enable researchers to integrate diverse datasets, thus liberating models, and allowing them to play a more significant role[4]. This paper synthesises recent research on statistical models in environmental science, examining the integration of big data and its impact on traditional methods, such as regression and time series analysis[5-6].

2. Air quality monitoring

With air quality issues getting more and more serious, especially in urban areas, it is vital to explore air quality monitoring methods and apply them into practice to better monitor air quality, predict possible air pollution and prepare solutions to improve air quality. In recent years, air pollution, especially haze, has posed a threat to residents’ physical health and mental wellness—some severe pollution can even cause ischemic heart disease, stroke and other serious diseases[1-2]. Moreover, poor air quality could adversely impact sustained economic growth. Thus, it is essential to accurately excavate and detect the information on air quality and monitor the conditions of air pollution to facilitate the provision of pollution prevention and find measures. Not only will this enable environmental agencies to take action in time, but it will also mitigate the risks of health diseases[4].

3. Statistical models

3.1. Time series forecasting models

1) ARIMA (Auto-Regressive Integrated Moving Average)

The ARIMA model, which was introduced by Box and Jenkins, is a combined model of three statistical models—Autoregressive (AR), Integrated (I) and Moving Average (MA), and is used to predict a value in a response Time Series. This model is presented as ARIMA(p, d, q), where p, d, q represent the autoregressive order respectively[7].

When using ARIMA to make predictions, after performing the stabilization process on the sequence of unstable time, a regression model will be established to determine the lag value based on the current value and lag value of the random error difference. Suppose that γ represents the original sequence while Y illustrates the sequence of difference, the prediction for Y can be presented in:

\( \hat{Y}=c+{∅_{1}}{γ_{t-1}}+…+{∅_{p}}{γ_{t-p}}+…-{θ_{1}}{e_{t-1}}-…-{θ_{q}}{e_{t-q}} \) (1)

where c represents constant, \( {∅_{1}}{γ_{t-1}}, …, {∅_{p}}{γ_{t-p}} \) represent AR, \( {θ_{1}}{e_{t-1}},…, {θ_{q}}{e_{t-q}} \) represent MA[4].

2) Prophet model

The Prophet Model is a forecasting model based on Bayesian time series that handles missing data, outliers, and irregular sampling, manages abrupt shifts and can model multiple seasonal patterns simultaneously.

The core formula is: \( {y_{t}}={g_{t}}+{s_{t}}+{h_{t}}+{ε_{t}} \) (Additive Form) or \( {y_{t}}={g_{t}}∙{s_{t}}∙{h_{t}}+{ε_{t}} \) (Multiplicative Form).

where: \( {g_{t}} \) is the trend component, \( {s_{t}} \) is the seasonality component, \( {h_{t}} \) is the external variables component and \( {ε_{t}} \) is the error term[2].

The additive form and multiplicative form are equivalent, due to the ability to transform between addition and multiplication through simple mathematical operations, providing flexibility to match the data’s behaviour.

3) LSTM (Long Short-Term Memory)

LSTM is a recurrent neural network (RNN) architecture that is widely used in predicting trends in pollution levels, weather and so on. LSTM can be employed in air quality forecasting to capture temporal dependencies in pollutants such as PM2.5. Compared with RNN, this model applies three “gates” to transform information. The core four formulas are as follows:

Forget Gate:

\( {f_{t}}=σ*({W_{f}}[{h_{t-1}}, {x_{t}}]+{b_{f}}) \) (2)

where \( {f_{t}} \) represents the forget gate activation vector, \( {W_{f}} \) represents the forget gate weigh matrix, \( [{h_{t-1}}, {x_{t}}] \) represents the concatenation of previous hidden state and current input vector, \( {b_{f}} \) represents the forget gate bias vector, and \( σ \) means the sigmoid activation function ( \( σ(z)=\frac{1}{1+{e^{-z}}}) \)

Input Gate:

\( {i_{t}}= σ({W_{i}}[{h_{t-1}}, {x_{t}}]+{b_{i}}) \)

\( \widetilde{{C_{t}}}=tanh({W_{C}}[{h_{t-1}}, {x_{t}}]+{b_{C}}) \) (3)

where \( {i_{t}} \) represents the input gate activation vector, \( \widetilde{{C_{t}}} \) represents the candidate cell state values; \( {W_{i}} \) and \( {W_{C}} \) represent the weight matrices for input gate and cell state update, respectively; \( {b_{i}} \) and \( {b_{C}} \) are bias vectors.

Cell State Update:

\( {C_{t}}={f_{t}}⨀{C_{t-1}}+{i_{t}}⨀\widetilde{{C_{t}}} \) (4)

where \( ⨀ \) represents element-wise multiplication.

Output Gate:

\( {o_{t}}= σ({W_{o}}[{h_{t-1}}, {x_{t}}]+{b_{o}}) \)

\( {h_{t}}={o_{t}}⨀tanh⁡({C_{t}}) \)

\( σ=\frac{1}{1+{e^{-x}}} \)

\( tanh{(x)}=\frac{{e^{x}}-{e^{-x}}}{{e^{x}}+{e^{-x}}} \) (5)

where \( {o_{t}} \) represents the output gate activation vector and \( {h_{t}} \) represents the current hidden state (also the output of the LSTM at time \( t \) )[4,8].

4) Same-period prediction models

The model can be used in prediction that is in a short period of time. The core formula is as follows:

\( {Y_{t}}={a_{0}}+{a_{1}}×{Y_{t-1}}+{a_{2}}×{Y_{t-2}}+{a_{3}}×{Y_{t-3}} \) (6)

where \( {a_{0}}, {a_{1}}, {a_{2}}…{a_{n}} \) represent regression coefficients, \( {Y_{t-n}} \) represents the pollutant concentration of n days before the day[9].

3.2. Linear regression models

1) Linear regression

Linear Regression is used to find the relationship between two or more than two variables. For instance, one can utilize it to investigate and measure the correlation between air quality indicators (a dependent variable) and influencing factors such as temperature, wind speed, humidity, or emission levels (independent variables). Some core formulas are as follows:

\( E[Y|{X_{1}}={x_{1}}, {X_{2}}={x_{2}}, …, {X_{K}}={x_{K}}]=ϕ({x_{1}}, {x_{2}}, …,{x_{K}}) \) (7)

where Y represents the response variable, \( {X_{1}}, {X_{2}},…, {X_{K}} \) represents the explanatory variables.

\( ϕ({x_{1}}, {x_{2}}, …,{x_{K}})={β_{0}}+{β_{1}}{x_{1}}+…+{β_{K}}{x_{K}} \) (8)

which is linear in parameters \( {β_{j}} \) [10].

2) Multiple linear regression (multivariate)

Multiple Linear Regression extends linear regression to multiple predictors. For n mutually independent values \( (x, y): \)

\( Y=Xβ+ε \)

\( {y_{i}}={β_{0}}+ {β_{1}}{x_{i1}}+ …+{β_{p-1}}{x_{i, p-1}}+{ε_{i}} \) (9)

This can also be represented in a matrix form)in which, the vector \( Y \) is the dependent variable, represented as \( Y={({y_{1}},…,{y_{n}})^{T}} \) ; the matrix \( X∈{R^{P}} \) is the independent variable, represented as \( X=(1, {x_{1}},…, {x_{p-1}}), β={({β_{0}}, {β_{1}}, …,{β_{p-1}})^{ \prime }}, {β_{0}} \) represents the intercept, \( {β_{i}}= \) the coefficient of the \( i \) -th Predictor Variable, error terms \( ε={({ε_{1}},{ε_{2}},…,{ε_{n}})^{ \prime }} \) with each error term \( {ε_{i}}~N(0, {σ^{2}}) \) [3].

3) Generalized Linear Models (GLM)

The Generalized Linear Models can be used when the dependent variable \( Y \) follows a discrete distribution. It extends linear regression to non-normal distributions (e.g., Poisson, Binomial). Core formulas are as follows:

Supposed that \( E(Y)=u={(E({Y_{1}}), …, E({Y_{n}}))^{T}}={({u_{1}}, …, {u_{n}})^{T}} \) ,

\( g(u)=Xβ \) (10)

in which, \( X=(1, {x_{1}}, …, {x_{p-1}}), {x_{i}} \) represents the \( i \) -th factor that influence \( Y \) , \( β=({β_{0}}, {β_{1}}, …,{β_{p-1}}) \prime , {β_{0}} \) represents the intercept, \( {β_{i}}= \) the coefficient of the \( i \) -th Predictor Variable[3].

4) Poisson regression

\( {Y_{i}}=E({Y_{i}})+{ε_{i}}, i=1,…,n \) (11)

Supposed that \( E(Y)=u={(E({Y_{1}}), …, E({Y_{n}}))^{T}}={({u_{1}}, …, {u_{n}})^{T}} \) , and there exists a linear relation between \( Y and X, \) denoted as \( μ=({x_{i}}, β) \) .

\( (i){μ_{i}}=({x_{i}}, β)=x_{i}^{T}β \)

\( (ii){μ_{i}}=({x_{i}}, β)=exp{(x_{i}^{T}, β)} \)

\( (iii){μ_{i}}=({x_{i}}, β)=log_{e}^{(x_{i}^{T}, β)} \)

\( {Y_{i}}=exp{(x_{i}^{T}, β)} \) , \( i=1,…,n \)

\( {μ_{i}}=exp{(x_{i}^{T}, β)} \)

\( E({Y_{i}}) \) = \( {μ_{i}} \) (12)

in which, \( {μ_{i}}≥0 \) [3].

5) Robust regression with mean shift penalization

\( g(μ)=Xβ+γ \)

\( μ=E(Y)={(μ, …, {μ_{n}})^{T}} \)

\( γ={(γ, …, {γ_{n}})^{T}} \) (13)

in which, \( {γ_{i}} \) represents the \( i \) -th mean shift parameter of the observed value[3].

3.3. Machine learning models and Gray Relational Analysis (GRA)

There are many effective machine learning models that can be used in air quality investigation, such as SVR (Support Vector Regression), XGBoost (Extreme Gradient Boosting), ELM (Extreme Learning Machine), Cluster Regression Models and BP Neural Networks[2,4,6,9].

GRA emphasizes measuring the relationships among sequences and can be employed for feature selection or assessing variable significance[1,6].

4. Applications and advances in statistical models

4.1. Regression analysis in air quality research

Linear regression models are frequently applied to explore connections between environmental conditions and indicators of air quality, including PM2.5 levels and AQI. For example, studies have applied regression methods to investigate how socio-economic and weather-related variables impact air quality across Chinese cities, illustrating the utility of classical regression analysis in detecting pollution patterns[2,11]. The study above found that including covariates (e.g., temperature or pollution shock variables) improved the accuracy of predictive in forecasting AQI using models like ARIMA[2]. Recently, Poisson regression models have been formulated to assess the impacts of pollutant concentrations and climatic variables on urban air quality, demonstrating the versatility of regression techniques in managing both skewed and discrete data[3].

4.2. Time series models for air quality forecasting

Time series analysis is crucial in forecasting air pollution patterns, particularly for short-term predictions. In recent years, China’s rapid development has led to significant environmental challenges, especially in air pollution, which has profoundly impacted residents' health conditions and living standards. In light of this context, Zhang et al. initiated a research program employing mathematical models and conventional time series methodologies, including ARIMA and seasonal decomposition, to address air pollution. Their findings indicated that the selected time series models were effective in predicting daily variations in the AQI, as demonstrated in studies concerning Beijing's air quality[2]. Motivated by the rising severity of air pollution caused by the rapidly growing industrialization in China, another study conducted by Li et al. aimed to address the challenge by using some new mathematical models due to the unsatisfied results of existing predictive systems like the WRF-FMAQ. With the growth of available data, however, advanced techniques like Long Short-Term Memory (LSTM) neural networks have emerged, enhancing accuracy by identifying nonlinear patterns within time series data. Comparisons between ARIMA and LSTM models reveal that, although neural networks are more computationally intensive, they offer superior precision for handling complex datasets[12].

4.3. Grey models and hybrid approaches

With the rapid economic growth in China, air pollution became a serious environmental issue, which also affected public health. However, the traditional models failed to effectively address noise and complex patterns in AQI data, prompting the great need for hybrid models. That is when Grey models have been applied to study pollution in cities where data availability is limited, utilising interval-based relationships to estimate pollutant levels[1]. Hybrid models that integrate grey correlation with neural networks or other machine learning techniques are becoming increasingly popular for their flexibility in managing both small and extensive datasets[4]. These hybrid methods combine the simplicity of traditional models with the adaptability of machine learning, resulting in more dependable predictive outcomes[5].

5. Challenges and improvements related to big data

Traditional models, such as regression analysis, were not originally designed for the extensive datasets produced by modern environmental sensors. As datasets grow, computational constraints become a barrier, especially for methods like Poisson regression, which may become computationally intensive with high-dimensional data. Machine learning methods optimised for big data, such as neural networks, present potential solutions but require considerable computational resources[12].

Extensive datasets often comprise heterogeneous information, integrating both structured and unstructured data from sources like satellites and sensors. Integrating these diverse data types into statistical models presents challenges, as each type demands specific preprocessing methods[9,13]. To address this, researchers are increasingly employing data fusion and transformation techniques to harmonise the data, a step essential for improving model accuracy.

A further challenge when utilising big data in statistical modeling is ensuring that models remain generalisable across different regions and time periods. Models specifically designed for certain locations, such as time series models for cities like Beijing or Chongqing, may be less effective elsewhere due to unique environmental factors and sources of pollution[5-6]. Therefore, big data analytics should integrate adaptive methods that allow these models to generalise effectively across diverse datasets.

6. Conclusion

The utilization of big data has significantly enhanced the efficacy of statistical models in environmental science, particularly in forecasting air quality trends. Conventional methods, such as regression and time series analysis, have evolved through the integration of hybrid methodologies and machine learning, leading to enhanced accuracy and predictive efficacy. Nevertheless, the adoption of big data introduces numerous challenges, such as the complexity of diverse data types, substantial computational requirements, and difficulties in achieving model generalisation across various regions and environmental contexts. To propel statistical modeling forward in ecological science, it is essential to develop models that are not only computationally optimised but also adept at integrating multiple data types. By addressing these issues with innovative data processing and adaptable modeling techniques, researchers and policymakers can harness big data more effectively, facilitating the creation of targeted, data-driven strategies for environmental management across different regions. The essay briefly demonstrates the potential of statistical models in air quality monitoring but might fall short in addressing critical challenges. Thus, future research will focus on computational solutions for big data, explore robust data integration techniques, improve model generalization strategies, and address data quality and validation comprehensively.

References

[1]. Zhang, W., Yang, W.S., Bai, Q., et al. (2022). Study the Correlation Between Motor Vehicle Ownership and Urban Air Quality in Xi’an Based on Descriptive Statistics and the Grey Correlation Model. Transport Research, 8(3), 111-119.

[2]. Zhang, N. (2023). Analysis and Prediction of China’s Air Quality Index Based on Statistical Models. Shanghai University of Finance and Economics.

[3]. Lu, G. (2021). Study on Factors Affecting Air Quality in Chinese Cities Using Robust Poisson Regression. Environmental Studies Journal, 3, 56-67.

[4]. Wang, F. (2022). Air Quality Prediction Using ELM and Multi-objective Grey Wolf Optimization Algorithm. Environmental Forecast Journal, 12, 88-96.

[5]. Zou, J. (2022). Prediction of Chongqing Air Quality Index Using Combined Weight Models. Southwest University, Master’s Thesis.

[6]. Zhao, C. (2023). Application of Mathematical Modeling in Roadside Air Quality Analysis. Chinese Science and Technology Information, 9, 60-63.

[7]. Araghinejad, S. and Araghinejad, S., 2014. Time Series Modeling (pp. 85-137). Springer Netherlands.

[8]. Siami-Namini, S., Tavakoli, N. and Namin, A.S., 2018, December. A comparison of ARIMA and LSTM in forecasting time series. In 2018 17th IEEE international conference on machine learning and applications (ICMLA) (pp. 1394-1401). Ieee.

[9]. Liu, M. (2014). Establishment and Application of Winter Environmental Air Quality Prediction Model in Shenyang. Environmental Monitoring in China, 30(4), 10-16.

[10]. Seber, G.A. and Lee, A.J., 2012. Linear regression analysis. John Wiley & Sons.

[11]. Lin, B. (2010). Multiple Linear Regression Analysis and Its Applications. China Science and Technology Information, 9, 60-63.

[12]. Li, G., Qiu, Z., Miao, J., et al. (2023). LSTM-based Air Quality Prediction Model. Journal of Southwest Minzu University, 49(1), 67-75.

[13]. Ye, S.Q., Huang, S.Y., Chen, D.H., et al. (2017). Application of Statistical Models in Urban Air Quality Forecasting. Guangdong Environmental Monitoring Center, 510308.

Cite this article

Han,Z. (2025). Application of Statistical Models in Air Quality Monitoring. Theoretical and Natural Science,79,67-73.

Data availability

The datasets used and/or analyzed during the current study will be available from the authors upon reasonable request.

Disclaimer/Publisher's Note

The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of EWA Publishing and/or the editor(s). EWA Publishing and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

About volume

Volume title: Proceedings of the 4th International Conference on Computing Innovation and Applied Physics

ISBN：978-1-83558-897-0(Print) / 978-1-83558-898-7(Online)

Editor：Ömer Burak İSTANBULLU, Marwan Omar, Anil Fernando

Conference website: https://2025.confciap.org/

Conference date: 17 January 2025

Series: Theoretical and Natural Science

Volume number: Vol.79

ISSN：2753-8818(Print) / 2753-8826(Online)

© 2024 by the author(s). Licensee EWA Publishing, Oxford, UK. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license. Authors who publish this series agree to the following terms:
1. Authors retain copyright and grant the series right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this series.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the series's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this series.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See Open access policy for details).