Classification Forecast Method of Costal Low Visibility Weather Based on Ensemble Learning
-
摘要: 在2020年3月—2021年7月福建漳州沿海地区融合实况资料与欧洲中心细网格模式预报产品的基础上,应用集成学习中的LightGBM(Light Gradient Boosting Machine)算法建立分类预报模型以预测低能见度天气。针对样本极端不均衡的问题,在建模与检验中分别采用Bagging(Bootstrap Aggregating)技术和AUC(Area Under Curve)评分进行解决。根据有无新特征构造和模型融合划分为四种方案进行试验,同时将逻辑回归建模方案作为对比。结果表明:(1)在所有特征中,2 m露点对判断低能见度天气发生发展最为重要,2 m与1 000 hPa温差的重要性次之;(2)所有建模方案均能改善模式原始预报,其中LightGBM模型总体效果优于逻辑回归模型,两者命中率相似,但前者空报率显著降低;(3)新特征构造与模型融合的技巧能够进一步改善预测性能,包含这两者的建模方案在测试集上表现更佳,其中新特征构造对模型的提升幅度更为突出。Abstract: A classification forecast method based on Light Gradient Boosting Machine (LightGBM) was utilized in this study to predict low visibility weather, using the coastal fusion observations and EC-thin model products of Zhangzhou from March 2020 to July 2021. The experiment was divided into four groups, including the new feature construction and model fusion schemes. The Bootstrap Aggregating (Bagging) technology and Area Under Curve (AUC) score were used to diminish the negative effect of extreme imbalance of samples, and the benchmark experiment employed the Logistics Regression (LR) method. The results showed that: (1) The most significant feature for estimating the possibility of low visibility weather was the 2 m dew point, followed by the temperature difference between 2m and 1000 hPa. (2) All model schemes exhibited improvement in comparison to the original forecast from the numerical model to varying degrees. In terms of metrics, the LightGBM model performed better than the LR model, largely due to its lower false alarm rate. (3) The skills of reasonable feature construction and model fusion contributed to optimizing the prediction performance and achieving higher scores on the test set. The impact of reasonable feature construction was particularly noteworthy.
-
Key words:
- low visibility /
- classification forecast /
- LightGBM /
- LoRa /
- AUC
-
表 1 2020年与2021年正负类样本分布
年份 正类样本数 负类样本数 正负比率 2020 846 174 750 0.48% 2021 146 6 582 2.22% 表 2 LoRa探测数据的检验指标(能见度分类阈值为1 000 m、邻域半径为2 000 m)
TS评分 漏报率 空报率 偏差 准确率 0.594 9 0.304 5 0.251 4 1.137 0 0.939 6 表 3 ALL-MIX方案在测试集上的交叉矩阵
预报值 实况V≥500 m 实况V<50 m 预报V≥500 m 6 356 2 预报V<500 m 226 144 表 4 ALL-LR方案在测试集上的交叉矩阵
预报值 实况V≥500 m 实况V<500 m 预报V≥500 m 4 671 4 预报V<500 m 1 911 142 -
[1] 王晓丽, 张苏平, 张晓梅, 等. 青岛市水平能见度变化特征及气象影响因子分析[J]. 气象科学, 2008, 28(S1): 31-36. [2] 王晓芙, 林长城, 陈晓秋, 等. 闽南沿海地区低能见度事件变化特征分析[J]. 气象, 2013, 39(4): 453-459. [3] 罗忠红, 江航东, 梁升, 等. 2016年厦门机场一次爆发性浓雾的天气条件分析[J]. 热带气象学报, 2020, 36(4): 499-507. [4] 王楠, 朱蕾, 周建军, 等. 基于EC细网格产品在乌鲁木齐机场低能见度预测中的释用[J]. 沙漠与绿洲气象, 2020, 14(2): 81-89. [5] 谢超, 马学款, 张恒德. 华南低能见度天气特征及客观预报研究[J]. 气象科学, 2019, 39(4): 556-561. [6] 黄辉军, 黄健, 刘春霞, 等. 用近地层温差因子改进广东沿海海雾区域预报[J]. 热带气象学报, 2013, 29(6): 907-914. [7] 黄健, 黄辉军, 黄敏辉, 等. 广东沿岸海雾决策树预报模型[J]. 应用气象学报, 2011, 22(1): 107-114. [8] 俞涵婷, 廖晨昕, 王可欣, 等. 浙江中南部海雾预报决策树模型研究[J]. 海洋预报, 2020, 37(6): 96-101. [9] 史得道, 吴振玲, 高山红, 等. 海雾预报研究综述[J]. 气象科技进展, 2016, 6(2): 49-55. [10] 王志宇. 基于LightGBM框架的上海市大气能见度预报订正研究[D]. 上海: 华东师范大学, 2019. [11] 李效东, 梁莺, 任雍. 基于LoRa的海雾探测技术研究及效果检验[J]. 海峡科学, 2021(7): 3-8. [12] 肖艳姣, 刘黎平. 新一代天气雷达网资料的三维格点化及拼图方法研究[J]. 气象学报, 2006, 64(5): 647-657. [13] 任师攀, 彭一宁. 基于软投票融合模型的消费信贷违约风险评估研究[J]. 金融理论与实践, 2020(4): 77-83. [14] KE G, MENG Q, FINELY T, et al. LightGBM: A highly efficient gradient boosting decision tree[C]//31st Conference on Neural Information Processing Systems(NIPS 2017). Long beach: ACM Digital Library, 2017: 3 149-3 157. [15] BREIMAN L, FRIEDMAN J H, OLSHEN R A, et al. Classification and regression trees[M]. Biometrics, 1984, 40(3): 358. [16] BRADLEY P. The use of the area under the ROC curve in the evaluation of machine learning algorithms[J]. Pattern Recognition, 1997, 30 (7): 1 145-1 159. [17] 汪云云, 陈松灿. 基于AUC的分类器评价和设计综述[J]. 模式识别与人工智能, 2011, 24(1): 64-71.