基于因果分析的机器学习模型在空气质量预报中的构建及评估——以广州市为例

云翔; 张璐瑶; 刘子菁; 翟志宏; 蔡顺明; 朱丽媛; 何耀斌

doi:10.16032/j.issn.1004-4965.2026.011

基于因果分析的机器学习模型在空气质量预报中的构建及评估——以广州市为例

doi: 10.16032/j.issn.1004-4965.2026.011

1.
广东省气象数据中心，广东广州 510640
2.
中国气象局广州热带海洋气象研究所，广东广州 510640
3.
广州气象卫星地面站，广东广州 510630
4.
广州市荔湾气象局，广东广州 510150

基金项目:

广东省气象局科技项目 GRMC2023M10

广东省气象局科技项目 GRMC2025Q42

中国气象局气象干部学院科研项目 2025CMATCQN16

广州市气象学会科学技术研究项目 Z202326

详细信息

通讯作者:
刘子菁，工程师，主要从事机器学习算法在卫星遥感资料方面的应用研究。E-mail：lzj18811152508@163.com

中图分类号: P437
计量
- 文章访问数: 3
- HTML全文浏览量: 0
- PDF下载量: 0
- 被引次数: 0
出版历程
- 收稿日期: 2024-07-28
- 修回日期: 2025-12-25
- 网络出版日期: 2026-03-14
- 刊出日期: 2026-02-20

Establishment and Evaluation of Machine Learning Models Based on Causal Analysis in Air Quality Forecasting: A Case Study of Guangzhou

1.
Guangdong Meteorological Data Centre, Guangzhou 510640, China
2.
Guangzhou Institute of Tropical and Marine Meteorology, China Meteorological Administration, Guangzhou 510640, China
3.
Guangzhou Meteorological Satellite Ground Station, Guangzhou 510630, China
4.
Liwan Meteorological Bureau, Guangzhou, 510150, China

摘要

摘要: 近年广州市大气污染的主要问题是持续性污染逐渐增多，O₃污染比例明显增多。为提高空气质量预报能力，基于环境监测数据和气象观测数据，使用梁氏-克里曼信息流对影响CO、NO₂、O₃、PM_2.5、PM₁₀、SO₂大气污染物浓度的数据进行因果分析筛选影响因子，基于随机森林（RF）、极限梯度提升（XGboost）和长短期记忆神经网络（LSTM）算法进行融合建模共得到RF、XG、LSTM及融合模型MIX1、MIX2共5个污染物浓度预报模型，进而计算AQI和首要污染物，得到5个空气质量预报模型。结果表明，MIX1、MIX2融合模型整体优于RF、XGboost和LSTM单一模型。对于污染物浓度预报，MIX1模型CO、NO₂、O₃浓度预报最优，MIX2模型PM₁₀、PM_2.5、SO₂预报最优。对于空气质量预报，提前1~2天的预报MIX2模型最优，提前3~7天的预报MIX1模型最优。MIX1、MIX2模型1~7天的首要污染预报准确率分别为71.26%~83.33%、73.71%~81.11%。MIX1、MIX2模型提前1~3天的污染物浓度和空气质量预报及提前1~7天的首要污染物预报可信度较高，可以为环境管理部门采取适当的控制措施提供参考。
- 机器学习 /
- 因果分析 /
- 空气质量预报
Abstract: To address the increasing challenge of persistent pollution and rising ozone (O₃) levels in Guangzhou, this study developed advanced air quality forecasting models using machine learning techniques. Based on environmental monitoring and meteorological observation data, the Liang-Kleeman information flow was used to conduct causal analysis on factors affecting the concentrations of atmospheric pollutants such as CO, NO₂, O₃, PM_2.5, PM₁₀, and SO₂. Using the Random Forest (RF), Extreme Gradient Boosting (XGBoost), and Long Short-Term Memory Neural Network (LSTM) algorithms for integrated modeling, five distinct pollutant concentration forecasting models (RF, XG, LSTM, and the integrated models MIX1 and MIX2) were constructed. These models forecast pollutant concentrations, which were then used to calculate the Air Quality Index (AQI) and identify the primary pollutant. The results show that the integrated models (MIX1 and MIX2) generally outperform the single ones (RF, XGBoost, and LSTM models). For pollutant concentration forecasting, the MIX1 model was optimal for CO, NO₂, and O₃, while the MIX2 model performed best for PM₁₀, PM_2.5, and SO₂. For air quality forecasting, the MIX2 model was superior for 1-2 day forecasts, whereas the MIX1 model was optimal for 3-7 day forecasts. The accuracy rates for primary pollutant by the MIX1 and MIX2 models for 1-7 day forecast were 71.26% - 83.33% and 73.71% - 81.11%, respectively. The models showed high reliability, with accuracy rates for primary pollutant identification ranging from 71.26% to 83.33% for MIX1 and 73.71% to 81.11% for MIX2 across the 1-7 day forecast, providing valuable tools for environmental authorities to implement targeted air pollution control measures.
- machine learning /
- causal analysis /
- air quality forecasting

HTML全文

图 1 2014—2022年广州各级别AQI出现频率逐年（a）和逐月（b）变化

下载: 全尺寸图片幻灯片

图 2 2014—2022年广州市污染持续时间逐年（a）和逐月（b）变化

下载: 全尺寸图片幻灯片

图 3 2014—2022年广州市首要污染物逐年（a~i）和逐月（j）变化

下载: 全尺寸图片幻灯片

图 4 不同大气污染物排名前60的预报因子

下载: 全尺寸图片幻灯片

图 5 型污染物浓度预报评估图，从左到右依次为平均绝对误差（MAE）、均方根误差（RMSE）和相关系数（R）

从上到下依次为CO、NO₂、O₃、PM₁₀、PM_2.5、SO₂。

下载: 全尺寸图片幻灯片

图 6 模型空气质量预报评估

从上到下依次为平均绝对误差（MAE）、均方根误差（RMSE）和相关系数（R）。

下载: 全尺寸图片幻灯片

图 7 模型首要污染物预报评估

下载: 全尺寸图片幻灯片

附图 1 579项预测因子总信息流热图

下载: 全尺寸图片幻灯片

附图 2 模型污染物浓度训练集评估图，从左到右依次为CO、NO₂、O₃、PM₁₀、PM_2.5、SO₂，从上到下依次为提前1~7天

下载: 全尺寸图片幻灯片

附图 3 模型污染物浓度测试集评估图，从左到右依次为CO、NO₂、O₃、PM₁₀、PM_2.5、SO₂，从上到下依次为提前1~7天

下载: 全尺寸图片幻灯片

表 1 提前1~7天较优模型筛选表

	Day1	Day2	Day3	Day4	Day5	Day6	Day7
CO	RF	RF	RF	RF	RF	RF	RF
NO₂	XG	RF	RF	RF	RF	RF	RF
O₃	XG	RF	RF	RF	RF	RF	RF
PM₁₀	RF	RF	RF	RF	RF	RF	RF
PM_2.5	RF	RF	RF	RF	RF	LSTM	RF
SO₂	RF	LSTM	LSTM	LSTM	LSTM	LSTM	LSTM
注：RF代表随机森林模型，XG代表XGboost模型，LSTM代表LSTM模型。

下载: 导出CSV

参考文献(78)

[1]	Han J, Liu H, Xiong H, et al. Semi-supervised air quality forecasting via self-supervised hierarchical graph neural network[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 35(5): 5 230-5 243.
[2]	张莹. 我国典型城市空气污染特征及其健康影响和预报研究[D]. 兰州: 兰州大学, 2016.
[3]	中国气象局. 十四五公共气象服务发展规划[M]. 北京: 气象出版社, 2021.
[4]	国务院. 国务院关于印发《空气质量持续改善行动计划》的通知[J]. 中华人民共和国国务院公报, 2023, 35: 12-18.
[5]	董梅, 张贤坤, 黄文杰, 等. 基于多尺度时空优化的空气质量预测方法[J]. 天津科技大学学报, 2024, 39 (2): 71-80.
[6]	Chen H, Lin Y, Su Q, et al. Spatial variation of multiple air pollutants and their potential contributions to all-cause, respiratory, and cardiovascular mortality across China in 2015—2016[J]. Atmospheric Environment, 2017, 168: 23-35.
[7]	劳腾飞. 基于深度学习的长三角地区PM_2.5浓度预测研究[D]. 南京: 南京信息工程大学, 2022.
[8]	容逸能. 基于梁氏-克里曼信息流的人工智能方法在台风路径预报中的应用[D]. 南京: 南京信息工程大学, 2021.
[9]	Ma J, Ding Y, Cheng J C P, et al. Identification of high impact factors of air quality on a national scale using big data and machine learning techniques[J]. Journal of Cleaner Production, 2020, 244: 118955.
[10]	Hooyberghs J, Mensink C, Dumont G, et al. A neural network forecast for daily average PM10 concentrations in Belgium[J]. Atmospheric Environment, 2005, 39(18): 3 279-3 289.
[11]	GarcíA Nieto P J, Combarro E F, Del Coz Díaz J J, et al. A SVM-based regression model to study the air quality at local scale in Oviedo urban area (Northern Spain): A case study[J]. Applied Mathematics and Computation, 2013, 219(17): 8 923-8 937.
[12]	Laña I, Del Ser J, Padró A, et al. The role of local urban traffic and meteorological conditions in air pollution: A data-based case study in Madrid, Spain[J]. Atmospheric Environment, 2016, 145: 424-438.
[13]	邱少明, 杨雯升, 杜秀丽, 等. 优化随机森林模型的网络故障预测[J]. 计算机应用与软件, 2021, 38 (2): 103-109, 170.
[14]	Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, U.S., 2016: 785-794.
[15]	Yu R, Yang Y, Yang L, et al. RAQ-A random forest approach for predicting air quality in urban sensing systems[J]. Sensors, 2016, 16(1): 86.
[16]	华俊玮, 白琳, 邢程, 等. 基于随机森林算法的滁州市空气质量预报研究[J]. 环境监测管理与技术, 2024, 36 (1): 74-78.
[17]	蔡旺华. 运用机器学习方法预测空气中臭氧浓度[J]. 中国环境管理, 2018, 10 (2): 78-84.
[18]	Lei T M, Ng S C, SIU S W. Application of ANN, XGBoost, and other ML methods to forecast air quality in Macau[J]. Sustainability, 2023, 15(6): 5 341.
[19]	赵小明, 顾珂铭, 张石清. 面向深度学习的空气质量预测研究进展[J]. 计算机系统应用, 2022, 31 (11): 49-59.
[20]	Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1 735-1 780.
[21]	Li X, Peng L, Yao X, et al. Long short-term memory neural network for air pollutant concentration predictions: Method development and evaluation[J]. Environmental Pollution, 2017, 231: 997-1 004.
[22]	马井会, 瞿元昊, 余钟奇, 等. 上海市PM_(2.5) 浓度延伸期预测模型的构建及评估[J]. 中国环境科学, 2023, 43 (07): 3 290-3 298.
[23]	黄春桃, 范东平, 卢集富, 等. 基于深度学习模型的广州市大气PM2.5和PM10浓度预测[J]. 环境工程, 2021, 39 (12): 135-140.
[24]	He H, Luo F. Study of LSTM air quality index prediction based on forecasting timeliness[C]//IOP Conference Series: Earth and Environmental Science. IOP Publishing, 2020, 446(3): 032 113.
[25]	Jiao Y, Wang Z, Zhang Y. Prediction of air quality index based on LSTM[C]//2019 IEEE 8th Joint International Information Technology and Artificial Intelligence Conference (ITAIC). Congqing, China, 2019: 17-20.
[26]	Liu X, Guo H. Air quality indicators and AQI prediction coupling long-short term memory (LSTM) and sparrow search algorithm (SSA): A case study of Shanghai[J]. Atmospheric Pollution Research, 2022, 13(10): 101551.
[27]	Pearl J. Theoretical impediments to machine learning with seven sparks from the causal revolution[J/OL]. arXiv preprint: 1801.04016, 2018.
[28]	Bengio Y, Deleu T, Rahaman N, et al. A meta-transfer objective for learning to disentangle causal mechanisms[J/OL]. arXiv preprint arXiv: 1901.10912, 2019.
[29]	SchöLkopf B. Causality for machine learning[J]. Probabilistic and causal inference: The works of Judea Pearl(1st ed), 2022: 765-804.
[30]	Liang X S. Normalized multivariate time series causality analysis and causal graph reconstruction[J]. Entropy, 2021, 23(6): 679.
[31]	Liang X S. Causation and information flow with respect to relative entropy[J]. Chaos, 2018, 28(7): 075311.
[32]	Liang X S, Kleeman R. Information transfer between dynamical system components[J]. Physical Review Letters, 2005, 95(24): 244 101.
[33]	Liang X S. Information flow within stochastic dynamical systems[J]. Physical Review E, 2008, 78(3): 031 113. .
[34]	Rong Y, Liang X S. A study of the impact of the fukushima nuclear leak on east china coastal regions[J]. Atmosphere-Ocean, 2018, 56(4): 254-267.
[35]	Liang X S, Xu F, Rong Y, et al. El Niño Modoki can be mostly predicted more than 10 years ahead of time[J]. Scientific Reports, 2021, 11 (1): 17 860.
[36]	Zhang Y, Liang X S. The causal role of South China Sea on the Pacific-North American teleconnection pattern[J]. Climate Dynamics, 2022, 59(5): 1 815-1 832.
[37]	Hristopulos D T, Babul A, Babul S A, et al. Disrupted information flow in resting-state in adolescents with sports related concussion[J]. Frontiers in Human Neuroscience, 2019, 13: 419.
[38]	Lu X F, Liu K, Liang X, et al. The break point-dependent causality between the cryptocurrency and emerging stock markets[J]. Economic Computation and Economic Cybernetics Studies and Research, 2020, 54(4): 203.
[39]	Yi B, Bose S. Quantum Liang information flow as causation quantifier[J]. Physical Review Letters, 2022, 129(2): 020501.
[40]	Breiman L. Random forest[J]. Machine Learning, 2001, 45(1): 5-32.
[41]	Verikas A, Gelzinis A, Bacauskiene M. Mining data with random forests: A survey and results of new tests[J]. Pattern Recognit, 2011, 44(2): 330-349.
[42]	Chen T, Guestrin C. Xgboost: A scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, U.S., 2016: 785-794.
[43]	Chen Y, Cui S, Chen P, et al. An LSTM-based neural network method of particulate pollution forecast in China[J]. Environmental Research Letters, 2021, 16(4): 044006.
[44]	Tong W, Li L, Zhou X, et al. Deep learning PM 2.5 concentrations with bidirectional LSTM RNN[J]. Air quality, Atmosphere&Health, 2019, 12: 411-423.
[45]	HJ 633-2012环境空气质量指数(AQI) 技术规定(试行)[S]. 2012.
[46]	HJ 1130-2020环境空气质量数值预报技术规范[S].
[47]	王明洁, 贺佳佳, 王书欣, 等. 基于AQI的深圳大气污染特征及其典型环流形势分析[J]. 生态环境学报, 2018, 27 (2): 268-275.
[48]	云翔, 申冲, 王春林, 等. 广州市大气污染特征及其典型环流形势分析[J]. 环境科学学报, 2023, 43 (1): 216-228.
[49]	郭方方, 谢绍东. PM2.5中二次硫酸盐和硝酸盐生成机制[J]. 化学进展, 2023, 35 (9): 1 313-1 326.
[50]	凌爱平. 雅安市SO₂、NO₂、CO浓度与气象条件相关性分析[J]. 地理科学研究, 2019 (4): 341-350.
[51]	田彪, 丁明虎, 孙维君, 等. 大气CO研究进展[J]. 地球科学进展, 2017, 32 (1): 34-43.
[52]	郭丁, 郭文斐, 赵建, 等. 黄土高原草地和农田系统碳动态对降雨、温度和CO₂浓度变化响应的模拟[J]. 草业学报, 2018, 27 (2): 1-14.
[53]	Yavuz V. An analysis of atmospheric stability indices and parameters under air pollution conditions[J]. Environmental Monitoring and Assessment, 2023, 195(8): 934.
[54]	廖要明, 陈德亮, 刘秋锋. 中国地气温差时空分布及变化趋势[J]. 气候变化研究进展, 2019, 15 (4): 374-384.
[55]	Zhang Y, Bo H, Jiang Z, et al. Untangling the contributions of meteorological conditions and human mobility to tropospheric NO₂ in Chinese mainland during the COVID-19 pandemic in early 2020[J]. National Science ReviewNational Science Review, 2021, 8(11): nwab061.
[56]	闫钟清, 齐玉春, 董云社, 等. 草地生态系统氮循环关键过程对全球变化及人类活动的响应与机制[J]. 草业学报, 2014 (6): 279-292.
[57]	Tian X P, Wang D, Wang Y Q, et al. Long-term variations and trends of tropospheric and ground-level NO₂ over typical coastal areas[J]. Ecological Indicators, 2024, 164: 112163.
[58]	郑向东. 云对中国区域卫星观测臭氧总量精度影响的检验分析[J]. 大气科学, 2008, 32 (6): 1 431-1 444.
[59]	严晓瑜, 缑晓辉, 杨婧, 等. 中国典型城市臭氧变化特征及其与气象条件的关系[J]. 高原气象, 2020, 39 (2): 296-299.
[60]	刘静达, 何超, 赵舒曼, 等. 七大区域不同温湿条件下臭氧浓度变化[J]. 环境科学, 2023, 44 (10): 5 392-5 399.
[61]	林莉文, 卞建春, 李丹, 等. 北京城区大气混合层内臭氧垂直结构特征的初步分析——基于臭氧探空[J]. 地球物理学报, 2018, 61 (7): 2 667-2 678.
[62]	梁晶, 曾青, 朱建国, 等. 植物对近地层高浓度臭氧响应的评价指标研究进展[J]. 中国生态农业学报, 2010, 18 (2): 440-445.
[63]	王磊, 刘端阳, 韩桂荣, 等. 南京地区近地面臭氧浓度与气象条件关系研究[J]. 环境科学学报, 2018, 38 (4): 1 285-1 296.
[64]	Liu Z, Pan Y, Song T, et al. Eddy covariance measurements of ozone flux above and below a southern subtropical forest canopy[J]. Science of The Total Environment, 2021, 791: 148338.
[65]	王宗爽, 付晓, 王占山, 等. 大气颗粒物吸湿性研究[J]. 环境科学研究, 2013 (4): 341-34.
[66]	Li J, Zhang H, Ying Q, et al. Impacts of water partitioning and polarity of organic compounds on secondary organic aerosol over eastern China[J]. Atmospheric Chemistry and Physics, 2020, 20(12): 7 291-7 306.
[67]	焦奕雯. 武汉市气象因子和下垫面特征与大气污染物的关系研究[D]. 武汉: 华中农业大学, 2015.
[68]	杨伟, 姜晓丽. 华北地区大气细颗粒物(PM2.5) 年际变化及其对土地利用/ 覆被变化的响应[J]. 环境科学, 2020, 41 (7): 2 995-3 003.
[69]	刘玲, 赵巧华, 汪靖. 长三角城市群冬季一次重污染天气过程分析[J]. 科学技术与工程, 2019, 19 (26): 376-383.
[70]	HUANG Y, GUO B, SUN H, et al. Relative importance of meteorological variables on air quality and role of boundary layer height[J]. Atmospheric Environment, 2021, 267: 118737.
[71]	Spoorthi B K, Debnath K, Basuri P, et al. Spontaneous weathering of natural minerals in charged water microdroplets forms nanomaterials [J]. Science, 2024, 384(6699): 1 012-1 017.
[72]	Han X, Lang Y, Guo Q, et al. Enhanced oxidation of SO₂ by H₂O₂ during haze events: Constraints from sulfur isotopes[J]. Journal of Geophysical Research: Atmosphere, 2022, 127(13): e2022JD036960.
[73]	丁愫, 陈报章, 王瑾, 等. 基于决策树的统计预报模型在臭氧浓度时空分布预测中的应用研究[J]. 环境科学学报, 2018, 38 (8): 3 229-3 242.
[74]	李文娟, 赵放, 郦敏杰, 等. 基于数值预报和随机森林算法的强对流天气分类预报技术[J]. 气象, 2018, 44 (12): 1 555-1 564.
[75]	黄泳熙, 朱云, 谢阳红, 等. 空气质量模拟与观测机器学习NO₂浓度预报[J]. 中国环境科学, 2023, 43 (12): 6 225-6 234.
[76]	周恒左, 廖鹏, 杨宏, 等. 数值模式及机器学习对兰州市近地面臭氧模拟适用性[J]. 中国环境科学, 2024, 44 (1): 15-27.
[77]	黄春桃, 范东平, 卢集富, 等. 基于深度学习模型的广州市大气PM2.5和PM10浓度预测[J]. 环境工程, 2021, 39 (12): 135-140.
[78]	徐爱兰, 张再峰, 孙强, 等. 基于深度学习的人工智能空气质量预报系统构建[J]. 中国环境监测, 2021, 37 (2): 89-95.