Developing an automated framework for eco-label information categorization using web crawling and Natural Language Processing techniques

IF 7.5 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Expert Systems with Applications Pub Date : 2025-04-24 DOI:10.1016/j.eswa.2025.127688

Ho Anh Thu Nguyen , Duy Hoang Pham , Byeol Kim , Yonghan Ahn , Nahyun Kwon

{"title":"Developing an automated framework for eco-label information categorization using web crawling and Natural Language Processing techniques","authors":"Ho Anh Thu Nguyen , Duy Hoang Pham , Byeol Kim , Yonghan Ahn , Nahyun Kwon","doi":"10.1016/j.eswa.2025.127688","DOIUrl":null,"url":null,"abstract":"<div><div>Eco-labels are extensively employed to assess the environmental performance of building materials. However, their management is often fragmented across disparate online databases with inconsistent data structures, presenting significant challenges for efficient information acquisition and management. This study explores the application of web crawling techniques, Natural Language Processing (NLP), and machine learning (ML) models to collect and categorize eco-label information, with the objective of advancing the automation of information management processes. The results demonstrate that the categorization models exhibit high performance, achieving F1-scores exceeding 0.95 on the test set and at least 0.76 when validating datasets incorporating temporally updated information. However, the limited availability of data for certain eco-labels, such as Forest Stewardship Council certification and Green Screen, substantially degrades model performance with updated data. Notably, traditional ML models leveraging manual feature engineering outperform deep learning models with automatic feature extraction when applied to web-crawled data. Furthermore, the TF-IDF feature extraction technique surpasses other n-gram-based approaches, with model performance declining as n-gram length increases. This study establishes a systematic framework that informs the selection of reliable data sources, feature engineering strategies, and ML algorithms for integrating web crawling, thereby enhancing the automation of eco-label information management.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"282 ","pages":"Article 127688"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425013107","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Eco-labels are extensively employed to assess the environmental performance of building materials. However, their management is often fragmented across disparate online databases with inconsistent data structures, presenting significant challenges for efficient information acquisition and management. This study explores the application of web crawling techniques, Natural Language Processing (NLP), and machine learning (ML) models to collect and categorize eco-label information, with the objective of advancing the automation of information management processes. The results demonstrate that the categorization models exhibit high performance, achieving F1-scores exceeding 0.95 on the test set and at least 0.76 when validating datasets incorporating temporally updated information. However, the limited availability of data for certain eco-labels, such as Forest Stewardship Council certification and Green Screen, substantially degrades model performance with updated data. Notably, traditional ML models leveraging manual feature engineering outperform deep learning models with automatic feature extraction when applied to web-crawled data. Furthermore, the TF-IDF feature extraction technique surpasses other n-gram-based approaches, with model performance declining as n-gram length increases. This study establishes a systematic framework that informs the selection of reliable data sources, feature engineering strategies, and ML algorithms for integrating web crawling, thereby enhancing the automation of eco-label information management.

查看原文本刊更多论文

利用网络抓取和自然语言处理技术开发生态标签信息分类的自动化框架

环保标签广泛用于评估建筑材料的环保性能。然而，它们的管理往往分散在不同的在线数据库中，数据结构不一致，这对有效的信息获取和管理提出了重大挑战。本研究探讨了网络抓取技术、自然语言处理（NLP）和机器学习（ML）模型在生态标签信息收集和分类中的应用，目的是促进信息管理过程的自动化。结果表明，分类模型表现出高性能，在测试集上达到f1得分超过0.95，在验证包含临时更新信息的数据集时至少达到0.76。然而，某些生态标签的数据有限，如森林管理委员会认证和绿色屏幕，大大降低了更新数据的模型性能。值得注意的是，当应用于网络抓取数据时，利用手动特征工程的传统ML模型优于具有自动特征提取的深度学习模型。此外，TF-IDF特征提取技术优于其他基于n-gram的方法，随着n-gram长度的增加，模型性能下降。本研究建立了一个系统框架，为选择可靠的数据源、特征工程策略和ML算法提供信息，以集成网络爬行，从而增强生态标签信息管理的自动化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Expert Systems with Applications 工程技术-工程：电子与电气

CiteScore

13.80

自引率

10.60%

发文量

2045

审稿时长

8.7 months

期刊介绍： Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.