Uso de Inteligência Artificial para Análise e Previsão da Qualidade da Água em Corpos Hídricos
Arquivos
Data
Autor(es)
Título da Revista
ISSN da Revista
Título de Volume
Editor
Abstract
This study investigates the application of machine learning techniques for water quality analysis and prediction in areas impacted by the Fundão Dam rupture in the Gualaxo do Norte River Basin (Mariana, Minas Gerais, Brazil). The research was structured in two complementary phases: unsupervised exploratory analysis through dimensionality reductio (PCA, UMAP, t-SNE) and clustering (K-Means, HDBSCAN), followed by supervised predictive modeling with three algorithms (Random Forest, Gradient Boosting Machines, and Multi-Layer Perceptron). The Santos (2018) database was used, containing 324 water samples collected from 27 points across 12 field campaigns.Unsupervised analysis identified an optimal configuration (Block C, 8 variables) with exceptional silhouette coefficient of 0.834 and temporal stability of 81.2%, revealing natural clusters robust to seasonal variability. Supervised modeling demonstrated Random Forest superiority, achieving 96.77% accuracy in the high-purity label scenario (151 samples), surpassing Santos’ (2018) baseline by 2.56 percentage points. Explainable artificial intelligence techniques (SHAP and LIME) identified Total Phosphorus and Escherichia coli as the most discriminative variables, with combined contribution exceeding 50% of decisions. Critical analysis revealed that label quality dominates data quantity: a scenario with 151 samples and purity ≥70% outperformed a configuration with 303 moderate-purity samples by 11.52 percentage points. Data leakage phenomenon was empirically demonstrated when dimensional projection coordinates are used as predictive variables, justifying their deliberate exclusion to ensure genuine generalization. Results validate the applicability of interpretable machine learning models as tools for post-disaster environmental monitoring, with practical implications for water resource management and proposing a simplified protocol focused on key variables.
