Performance Analysis of Decision Tree Ensemble Models and Feature Importance Analysis in Prediction of Particulate Matter PM10
Abstract
Particulate Matter induced air pollution is known to have significant negative impacts on both the environment and human health. This research evaluates the effectiveness of various decision tree ensemble models in predicting daily PM10 concentrations in Thiruvananthapuram, Kerala, from July 2017 to December 2019. Seven decision tree ensemble models, namely Random Forest, Extra Trees, Gradient Boosting, AdaBoost, LightGBM, XGBoost, and Histogram-Based Gradient Boosting are employed here. To address missing data in the dataset, kNN imputation is utilized for a cohesive dataset suitable for model training. The models utilize both meteorological and air pollutant variables, with performance assessment using metrics such as the coefficient of determination (R²), root mean square error (RMSE) and mean absolute error (MAE). The findings indicate that the Extra Trees regression model provided the best prediction performance (R² = 0.9397, RMSE = 6.664 μg/m³, MAE = 4.950 μg/m³). Histogram-Based Gradient Boosting and Random Forest also demonstrate strong predictive capabilities. The explainability of the best prediction models is conducted by the feature importance analysis process. Feature importance analysis highlighted sulfur dioxide (SO2) as the most significant pollutant influencing PM10 levels, alongside meteorological factors like wind speed and rainfall, enhancing both prediction accuracy and interpretability of results. This research represents the first comprehensive effort to predict PM10 levels in Thiruvananthapuram using machine learning techniques, addressing a gap in regional air quality studies.
Downloads
References
Wu X, Wang Y, He S, Wu Z. PM 2.5/PM 10 ratio prediction based on a long short-term memory neural network in Wuhan, China. Geoscientific Model Development. 2020;13(3):1499–511.
Tong X, Ho JMW, Li Z, Lui KH, Kwok TC, Tsoi KK, et al. Prediction model for air particulate matter levels in the households of elderly individuals in Hong Kong. Science of The Total Environment. 2020;717:135323.
Brunekreef B, Holgate ST. Air pollution and health. The Lancet. 2002 Oct 19;360(9341):1233–42.
Pope III CA. Lung Cancer, Cardiopulmonary Mortality, and Long-term Exposure to Fine Particulate Air Pollution. JAMA. 2002 Mar 6;287(9):1132.
Samet JM, Zeger SL, Dominici F, Curriero F, Coursac I, Dockery DW, et al. The National Morbidity, Mortality, and Air Pollution Study. Part II: Morbidity and mortality from air pollution in the United States. Res Rep Health Eff Inst. 2000 Jun;94(Pt 2):5–70.
Huffman MD, Prabhakaran D, Osmond C, Fall CHD, Tandon N, Lakshmy R, et al. Incidence of Cardiovascular Risk Factors in an Indian Urban Cohort. J Am Coll Cardiol. 2011 Apr 26;57(17):1765–74.
Gakidou E, Afshin A, Abajobir AA, Abate KH, Abbafati C, Abbas KM, et al. Global, regional, and national comparative risk assessment of 84 behavioural, environmental and occupational, and metabolic risks or clusters of risks, 1990–2016. The Lancet. 2017 Sep 16;390(10100):1345–422.
Analitis A, Barratt B, Green D, Beddows A, Samoli E, Schwartz J, et al. Prediction of PM2.5 concentrations at the locations of monitoring sites measuring PM10 and NOx, using generalized additive models and machine learning methods. Atmospheric Environment. 2020 Nov 1;240:117757.
Xi X, Wei Z, Xiaoguang R, Yijie W, Xinxin B, Wenjun Y, et al. A comprehensive evaluation of air pollution prediction improvement by a machine learning method. In: 2015 IEEE International Conference on Service Operations And Logistics, And Informatics (SOLI). IEEE; 2015. p. 176–81.
Hooyberghs J, Mensink C, Dumont G, Fierens F, Brasseur O. A neural network forecast for daily average PM10 concentrations in Belgium. Atmospheric Environment. 2005 Jun 1;39(18):3279–89.
Perez P, Reyes J. Prediction of maximum of 24-h average of PM10 concentrations 30h in advance in Santiago, Chile. Atmospheric Environment. 2002 Sep 1;36(28):4555–61.
Brokamp C, Jandarov R, Hossain M, Ryan P. Predicting Daily Urban Fine Particulate Matter Concentrations Using a Random Forest Model. Environmental Science & Technology. 2018 Apr 3;52(7):4173–9.
Weizhen H, Zhengqiang L, Yuhuan Z, Hua X, Ying Z, Kaitao L, et al. Using support vector regression to predict PM10 and PM2.5. IOP Conference Series: Earth and Environmental Science. 2014 Mar;17:012268.
Wang P, Liu Y, Qin Z, Zhang G. A novel hybrid forecasting model for PM10 and SO2 daily concentrations. Science of The Total Environment. 2015 Feb 1;505:1202–12.
Park J, Yoo S, Kim K, Gu Y, Lee K, Son U. PM10 density forecast model using long short term memory. In: 2017 Ninth International Conference on Ubiquitous and Future Networks (ICUFN). IEEE; 2017. p. 576–81.
Suleiman A, Tight MR, Quinn AD. Applying machine learning methods in managing urban concentrations of traffic-related particulate matter (PM10 and PM2.5). Atmospheric Pollution Research. 2019 Jan 1;10(1):134–44.
Ibrir A, Kerchich Y, Hadidi N, Merabet H, Hentabli M. Prediction of the concentrations of PM1, PM2.5, PM4, and PM10 by using the hybrid dragonfly-SVM algorithm. Air Quality, Atmosphere & Health. 2021 Mar 1;14(3):313–23.
Saini J, Dutta M, Marques G. A novel application of fuzzy inference system optimized with particle swarm optimization and genetic algorithm for PM10 prediction. Soft Computing. 2022 Sep 1;26(18):9573–86.
Kim BY, Lim YK, Cha JW. Short-term prediction of particulate matter (PM10 and PM2.5) in Seoul, South Korea using tree-based machine learning algorithms. Atmospheric Pollution Research. 2022 Oct 1;13(10):101547.
Guo Q, He Z, Wang Z. Prediction of Hourly PM2.5 and PM10 Concentrations in Chongqing City in China Based on Artificial Neural Network. Aerosol and Air Quality Research. 2023;23(6):220448.
Nasabpour Molaei S, Salajegheh A, Khosravi H, Nasiri A, Ranjbar Saadat Abadi A. Prediction of hourly PM10 concentration through a hybrid deep learning-based method. Earth Science Informatics. 2024 Feb 1;17(1):37–49.
Erhan L, Di Mauro M, Anjum A, Bagdasar O, Song W, Liotta A. Embedded Data Imputation for Environmental Intelligent Sensing: A Case Study. Sensors. 2021 Nov 23;21(23):7774.
Saeipourdizaj P, Sarbakhsh P, Gholampour A. Application of imputation methods for missing values of PM10 and O3 data. Environmental Health Engineering and Management Journal. 2021 Aug 10;8(3):215–26.
Oktaviani ID, Putrada AG. KNN imputation to missing values of regression-based rain duration prediction on BMKG data. JURNAL INFOTEL. 2022 Nov 1;14(4):249–54.
Juna A, Umer M, Sadiq S, Karamti H, Eshmawi AA, Mohamed A, et al. Water Quality Prediction Using KNN Imputer and Multilayer Perceptron. Water. 2022 Jan;14(17):2592.
Zhang S. Nearest neighbor selection for iteratively kNN imputation. Journal of Systems and Software. 2012 Nov 1;85(11):2541–52.
Kim SL, D. Imputation method for missing data based on KNN and pattern consistency index in microarray data. The Korean Data & Information Science Society. 2018 Sep 30;(5):1179–87.
Sundararajan A, Sarwat AI. Evaluation of Missing Data Imputation Methods for an Enhanced Distributed PV Generation Prediction. In: Proceedings of the Future Technologies Conference (FTC) 2019. Springer International Publishing; 2020. p. 590–609.
Alianso AS, Syafaah L, Faruq A. K-nearest neighbor imputation for missing value in hepatitis data. AIP Conference Proceedings. 2022 Jul 25;2453(1):020057.
Atik SO, Atik ME. Optimal band selection using explainable artificial intelligence for machine learning-based hyperspectral image classification. Journal of Applied Remote Sensing. 2024 Jul;18(4):042604.
Beretta L, Santaniello A. Nearest neighbor imputation algorithms: a critical evaluation. BMC Medical Informatics and Decision Making. 2016 Jul 25;16(3):74.
Murti DMP, Pujianto U, Wibawa AP, Akbar MI. K-Nearest Neighbor (K-NN) based Missing Data Imputation. In: 2019 5th International Conference on Science in Information Technology (ICSITech). IEEE; 2019. p. 83–8.
Breiman L. Random Forests. Machine Learning. 2001 Oct 1;45(1):5–32.
Babar B, Luppino LT, Boström T, Anfinsen SN. Random forest regression for improved mapping of solar irradiance at high latitudes. Solar Energy. 2020 Mar 1;198:81–92.
Babu S, Thomas B. A survey on air pollutant PM2.5 prediction using random forest model. Environmental Health Engineering and Management Journal. 2023 Mar 10;10(2):157–63.
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Machine Learning. 2006 Apr 1;63(1):3–42.
Yarveicy H, Ghiasi MM. Modeling of gas hydrate phase equilibria: Extremely randomized trees and LSSVM approaches. Journal of Molecular Liquids. 2017 Oct 1;243:533–41.
Nistane V, Harsha S. Performance evaluation of bearing degradation based on stationary wavelet decomposition and extra trees regression. World Journal of Engineering. 2018 Jan 1;15(5):646–58.
Basu V. Prediction of Stellar Age with the Help of Extra-Trees Regressor in Machine Learning. Social Science Research Network. 2020 Mar.
Zhang Z, Zhao Y, Canes A, Steinberg D, Lyashevska O, ABDCTC Group. Predictive analytics with gradient boosting in clinical medicine. Annals of Translational Medicine. 2019 Apr;7(7):152.
Friedman JH. Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics. 2001;29(5):1189–232.
Li X, Li W, Xu Y. Human Age Prediction Based on DNA Methylation Using a Gradient Boosting Regressor. Genes. 2018 Sep;9(9):424.
Kujawska J, Kulisz M, Oleszczuk P, Cel W. Machine Learning Methods to Forecast the Concentration of PM10 in Lublin, Poland. Energies. 2022 Jan;15(17):6428.
Sun X, Liu M, Sima Z. A novel cryptocurrency price trend forecasting model based on LightGBM. Finance Research Letters. 2020 Jan 1;32:101084.
Liu Y, Zhu R, Zhai S, Li N, Li C. Lithofacies identification of shale formation based on mineral content regression using LightGBM algorithm. Energy Science & Engineering. 2023;11(11):4256–72.
Xuan L, Lin Z, Liang J, Huang X, Li Z, Zhang X, et al. Prediction of resilience and cohesion of deep-fried tofu by ultrasonic detection and LightGBM regression. Food Control. 2023 Dec 1;154:110009.
Shehadeh A, Alshboul O, Al Mamlook RE, Hamedat O. Machine learning models for predicting the residual value of heavy construction equipment. Automation in Construction. 2021 Sep 1;129:103827.
Schapire RE. Explaining AdaBoost. In: Empirical Inference: Festschrift in Honor of Vladimir N Vapnik. Springer; 2013. p. 37–52.
Freund Y, Schapire RE. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences. 1997 Aug 1;55(1):119–39.
Koduri SB, Gunisetti L, Ramesh CR, Mutyalu KV, Ganesh D. Prediction of crop production using adaboost regression method. Journal of Physics: Conference Series. 2019 May;1228(1):012005.
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery; 2016. p. 785–94.
Abbasi RA, Javaid N, Ghuman MNJ, Khan ZA, Ur Rehman S, Amanullah. Short Term Load Forecasting Using XGBoost. In: Web, Artificial Intelligence and Network Applications. Springer International Publishing; 2019. p. 1120–31.
Lartey B, Homaifar A, Girma A, Karimoddini A, Opoku D. XGBoost: a tree-based approach for traffic volume prediction. In: 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE; 2021. p. 1280–6.
Guryanov A. Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. In: Analysis of Images, Social Networks and Texts. Springer International Publishing; 2019. p. 39–50.
Tamim Kashifi M, Ahmad I. Efficient Histogram-Based Gradient Boosting Approach for Accident Severity Prediction With Multisource Data. Transportation Research Record. 2022 Jun 1;2676(6):236–58.
Hossain SMdM, Deb K. Plant Leaf Disease Recognition Using Histogram Based Gradient Boosting Classifier. In: Intelligent Computing and Optimization. Springer International Publishing; 2021. p. 530–45.
Alhams A, Abdelhadi A, Badri Y, Sassi S, Renno J. Enhanced Bearing Fault Diagnosis Through Trees Ensemble Method and Feature Importance Analysis. Journal of Vibration Engineering & Technology. 2024 Dec 1;12(1):109–25.
Feng DC, Wang WJ, Mangalathu S, Hu G, Wu T. Implementing ensemble learning methods to predict the shear strength of RC deep beams with/without web reinforcements. Engineering Structures. 2021 May 15;235:111979.
Kanaparthi V. Credit Risk Prediction using Ensemble Machine Learning Algorithms. In: 2023 International Conference on Inventive Computation Technologies (ICICT). IEEE; 2023. p. 41–7.
Sun Y, Li G, Zhang N, Chang Q, Xu J, Zhang J. Development of ensemble learning models to evaluate the strength of coal-grout materials. International Journal of Mining Science and Technology. 2021 Mar 1;31(2):153–62.
Copyright (c) 2025 EMITTER International Journal of Engineering Technology

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
The copyright to this article is transferred to Politeknik Elektronika Negeri Surabaya(PENS) if and when the article is accepted for publication. The undersigned hereby transfers any and all rights in and to the paper including without limitation all copyrights to PENS. The undersigned hereby represents and warrants that the paper is original and that he/she is the author of the paper, except for material that is clearly identified as to its original source, with permission notices from the copyright owners where required. The undersigned represents that he/she has the power and authority to make and execute this assignment. The copyright transfer form can be downloaded here .
The corresponding author signs for and accepts responsibility for releasing this material on behalf of any and all co-authors. This agreement is to be signed by at least one of the authors who have obtained the assent of the co-author(s) where applicable. After submission of this agreement signed by the corresponding author, changes of authorship or in the order of the authors listed will not be accepted.
Retained Rights/Terms and Conditions
- Authors retain all proprietary rights in any process, procedure, or article of manufacture described in the Work.
- Authors may reproduce or authorize others to reproduce the work or derivative works for the author’s personal use or company use, provided that the source and the copyright notice of Politeknik Elektronika Negeri Surabaya (PENS) publisher are indicated.
- Authors are allowed to use and reuse their articles under the same CC-BY-NC-SA license as third parties.
- Third-parties are allowed to share and adapt the publication work for all non-commercial purposes and if they remix, transform, or build upon the material, they must distribute under the same license as the original.
Plagiarism Check
To avoid plagiarism activities, the manuscript will be checked twice by the Editorial Board of the EMITTER International Journal of Engineering Technology (EMITTER Journal) using iThenticate Plagiarism Checker and the CrossCheck plagiarism screening service. The similarity score of a manuscript has should be less than 25%. The manuscript that plagiarizes another author’s work or author's own will be rejected by EMITTER Journal.
Authors are expected to comply with EMITTER Journal's plagiarism rules by downloading and signing the plagiarism declaration form here and resubmitting the form, along with the copyright transfer form via online submission.
