Análise comparativa de técnicas de correspondência textual de produto: coleta de dados com web scraping e análise baseada em PLN e aprendizado de máquina
Data
Autor(es)
Título da Revista
ISSN da Revista
Título de Volume
Editor
Abstract
The rapid growth of Brazilian e-commerce — particularly within the supermarket sector — has intensified the demand for automated tools capable of comparing products across multiple platforms. In this context, textual product matching emerges as a central challenge due to the absence of unique identifiers and the high heterogeneity of product descriptions, which are frequently noisy, semi-structured, and inconsistent. Therefore, this study justifies the investigation of Natural Language Processing (NLP) and machine learning techniques capable of handling real-world data environments, aiming to improve Entity Resolution in the grocery retail domain. The objective of this work was to perform a comparative analysis of the performance of six textual matching techniques — three classical (Levenshtein, Jaccard, and Jaro-Winkler), two vector-based (Bag-of-Words and TF-IDF with Cosine Similarity), and one semantic approach (SBERT) — applied to a real dataset collected via web scraping from four distinct e-commerce sources. The methodology encompassed data collection, preprocessing, implementation of the techniques, and quantitative evalua- tion based on Precision, Recall, and F1-Score, validated through a manually generated ground truth. The results showed that classical and vector-based techniques achieved superior performance for exact matching tasks, while the semantic SBERT model presented limitations, mainly due to the absence of fine-tuning and the occurrence of “semantic hallucinations,” as evidenced in the results. Consequently, the initial hypothesis — that the semantic technique would be the most effective — was partially refuted, although showing promising directions for future adaptations. It is concluded that all objectives were successfully achieved, and that lexical and vector-based methods remain more suitable for precise matching problems in supermarket datasets. The main contributions include the creation of a real dataset, a replicable practical pipeline, and a critical evaluation of the methods — offering valuable insights for hybrid solutions and future research involving domain adaptation.
