Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data

Kangyang Chen; Hexia Chen; Chuanlong Zhou; Yichao Huang; Xiangyang Qi; Ruqin Shen; Fengrui Liu; Min Zuo; Xinyi Zou; Jinfeng Wang; Yan Zhang; Da Chen; Xingguo Chen; Yongfeng Deng; Hongqiang Ren

doi:10.1016/j.watres.2019.115454

Comparative analysis of surface water quality prediction performance and identification of key water parameters using different machine learning models based on big data

Water Res. 2020 Mar 15:171:115454. doi: 10.1016/j.watres.2019.115454. Epub 2019 Dec 31.

Authors

Kangyang Chen¹, Hexia Chen², Chuanlong Zhou², Yichao Huang², Xiangyang Qi¹, Ruqin Shen³, Fengrui Liu⁴, Min Zuo⁵, Xinyi Zou¹, Jinfeng Wang⁶, Yan Zhang⁶, Da Chen², Xingguo Chen⁷, Yongfeng Deng⁸, Hongqiang Ren⁶

Affiliations

¹ Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing, China.
² School of Environment, Guangzhou Key Laboratory of Environmental Exposure and Health, Guangdong Key Laboratory of Environmental Pollution and Health, Jinan University, Guangzhou, Guangdong, 510632, China.
³ School of Environment, Guangzhou Key Laboratory of Environmental Exposure and Health, Guangdong Key Laboratory of Environmental Pollution and Health, Jinan University, Guangzhou, Guangdong, 510632, China; State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China.
⁴ College of Literature, Science, and the Arts, University of Michigan, Ann Arbor, MI, 48109, USA.
⁵ National Engineering Laboratory for Agri-product Quality Traceability, Beijing Technology and Business University, Beijing, Beijing, 100048, China.
⁶ State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China.
⁷ Jiangsu Key Laboratory of Big Data Security & Intelligent Processing, Nanjing University of Posts and Telecommunications, Nanjing, China; State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, Jiangsu, 210023, China. Electronic address: chenxg@njupt.edu.cn.
⁸ School of Environment, Guangzhou Key Laboratory of Environmental Exposure and Health, Guangdong Key Laboratory of Environmental Pollution and Health, Jinan University, Guangzhou, Guangdong, 510632, China; State Key Laboratory of Pollution Control and Resource Reuse, School of the Environment, Nanjing University, Nanjing, Jiangsu, 210023, China. Electronic address: yongfengdeng@jnu.edu.cn.

PMID: 31918388
DOI: 10.1016/j.watres.2019.115454

Abstract

The water quality prediction performance of machine learning models may be not only dependent on the models, but also dependent on the parameters in data set chosen for training the learning models. Moreover, the key water parameters should also be identified by the learning models, in order to further reduce prediction costs and improve prediction efficiency. Here we endeavored for the first time to compare the water quality prediction performance of 10 learning models (7 traditional and 3 ensemble models) using big data (33,612 observations) from the major rivers and lakes in China from 2012 to 2018, based on the precision, recall, F1-score, weighted F1-score, and explore the potential key water parameters for future model prediction. Our results showed that the bigger data could improve the performance of learning models in prediction of water quality. Compared to other 7 models, decision tree (DT), random forest (RF) and deep cascade forest (DCF) trained by data sets of pH, DO, CODMn, and NH₃-N had significantly better performance in prediction of all 6 Levels of water quality recommended by Chinese government. Moreover, two key water parameter sets (DO, CODMn, and NH3-N; CODMn, and NH3-N) were identified and validated by DT, RF and DCF to be high specificities for perdition water quality. Therefore, DT, RF and DCF with selected key water parameters could be prioritized for future water quality monitoring and providing timely water quality warning.

Keywords: Deep cascade forest; Ensemble methods; Machine learning models; The key water parameters; Water quality prediction.

MeSH terms

Big Data
China
Machine Learning
Water Quality*
Water*

Substances

Water