以数据为中心的机器学习方法挑战的调查，以弥合生命周期库存数据差距

IF 5.4 3区环境科学与生态学 Q2 ENGINEERING, ENVIRONMENTAL

Journal of Industrial Ecology Pub Date : 2025-04-21 DOI:10.1111/jiec.70022

Bu Zhao, Jitong Jiang, Ming Xu, Qingshi Tu

{"title":"以数据为中心的机器学习方法挑战的调查，以弥合生命周期库存数据差距","authors":"Bu Zhao, Jitong Jiang, Ming Xu, Qingshi Tu","doi":"10.1111/jiec.70022","DOIUrl":null,"url":null,"abstract":"<p>Life cycle assessment (LCA) is a systematic approach to quantify the environmental impacts of a product system from its entire life cycle. Despite its wide use in assessing mature technologies, the inventory data gap has been a fundamental challenge that limits the application of LCA to emerging new processes. Machine learning (ML) methods are among the possible solutions that can mitigate these data gaps in an automated and scalable way. Nonetheless, the performance of existing ML methods is unstable which limits the trustworthiness and generalizability of the models. In this study, we conducted a data-centric investigation to delineate the causes of the unstable performance using a similarity-based ML framework based on Ecoinvent 3.1 unit process (UPR) database. We found that the pattern of imbalance in the data for method development, manifest by the substantial differences in (1) flow and process availability and (2) the order of magnitude of their values, is a major cause of the unstable performance. We also identified the causes due to the challenges with ML method development workflow, particularly, the steps of data preprocessing, and ML model training (e.g., randomness in train–test data splits). In addition, we also tested the proposed ML method on the U.S. Life Cycle Inventory Database, where we observed that the generalizability of the method was highly influenced by the database size of the application. To address these issues, we proposed that further research should focus on reducing the barriers in database integration such that both the size and balance of the data for ML method development can be improved.</p>","PeriodicalId":16050,"journal":{"name":"Journal of Industrial Ecology","volume":"29 3","pages":"955-966"},"PeriodicalIF":5.4000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jiec.70022","citationCount":"0","resultStr":"{\"title\":\"A data-centric investigation on the challenges of machine learning methods for bridging life cycle inventory data gaps\",\"authors\":\"Bu Zhao, Jitong Jiang, Ming Xu, Qingshi Tu\",\"doi\":\"10.1111/jiec.70022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Life cycle assessment (LCA) is a systematic approach to quantify the environmental impacts of a product system from its entire life cycle. Despite its wide use in assessing mature technologies, the inventory data gap has been a fundamental challenge that limits the application of LCA to emerging new processes. Machine learning (ML) methods are among the possible solutions that can mitigate these data gaps in an automated and scalable way. Nonetheless, the performance of existing ML methods is unstable which limits the trustworthiness and generalizability of the models. In this study, we conducted a data-centric investigation to delineate the causes of the unstable performance using a similarity-based ML framework based on Ecoinvent 3.1 unit process (UPR) database. We found that the pattern of imbalance in the data for method development, manifest by the substantial differences in (1) flow and process availability and (2) the order of magnitude of their values, is a major cause of the unstable performance. We also identified the causes due to the challenges with ML method development workflow, particularly, the steps of data preprocessing, and ML model training (e.g., randomness in train–test data splits). In addition, we also tested the proposed ML method on the U.S. Life Cycle Inventory Database, where we observed that the generalizability of the method was highly influenced by the database size of the application. To address these issues, we proposed that further research should focus on reducing the barriers in database integration such that both the size and balance of the data for ML method development can be improved.</p>\",\"PeriodicalId\":16050,\"journal\":{\"name\":\"Journal of Industrial Ecology\",\"volume\":\"29 3\",\"pages\":\"955-966\"},\"PeriodicalIF\":5.4000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jiec.70022\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Industrial Ecology\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jiec.70022\",\"RegionNum\":3,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ENVIRONMENTAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Industrial Ecology","FirstCategoryId":"93","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jiec.70022","RegionNum":3,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}

引用次数: 0

摘要

生命周期评价（LCA）是一种从产品整个生命周期对其环境影响进行量化的系统方法。尽管LCA在评估成熟技术方面得到了广泛的应用，但库存数据差距一直是限制LCA在新兴工艺中应用的一个基本挑战。机器学习（ML）方法是可以以自动化和可扩展的方式缓解这些数据差距的可能解决方案之一。然而，现有的机器学习方法的性能不稳定，限制了模型的可信度和泛化性。在这项研究中，我们进行了以数据为中心的调查，使用基于Ecoinvent 3.1单元过程（UPR）数据库的基于相似性的ML框架来描述性能不稳定的原因。我们发现，方法开发数据的不平衡模式，体现在(1)流程和过程可用性以及(2)其值的数量级上的实质性差异，是性能不稳定的主要原因。我们还确定了由于ML方法开发工作流程的挑战而导致的原因，特别是数据预处理的步骤和ML模型训练（例如，训练测试数据分割的随机性）。此外，我们还在美国生命周期库存数据库上测试了提出的ML方法，在那里我们观察到该方法的通用性受到应用程序数据库大小的高度影响。为了解决这些问题，我们建议进一步的研究应该集中在减少数据库集成中的障碍，这样ML方法开发的数据的大小和平衡都可以得到改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

A data-centric investigation on the challenges of machine learning methods for bridging life cycle inventory data gaps

查看原文本刊更多论文

A data-centric investigation on the challenges of machine learning methods for bridging life cycle inventory data gaps

Life cycle assessment (LCA) is a systematic approach to quantify the environmental impacts of a product system from its entire life cycle. Despite its wide use in assessing mature technologies, the inventory data gap has been a fundamental challenge that limits the application of LCA to emerging new processes. Machine learning (ML) methods are among the possible solutions that can mitigate these data gaps in an automated and scalable way. Nonetheless, the performance of existing ML methods is unstable which limits the trustworthiness and generalizability of the models. In this study, we conducted a data-centric investigation to delineate the causes of the unstable performance using a similarity-based ML framework based on Ecoinvent 3.1 unit process (UPR) database. We found that the pattern of imbalance in the data for method development, manifest by the substantial differences in (1) flow and process availability and (2) the order of magnitude of their values, is a major cause of the unstable performance. We also identified the causes due to the challenges with ML method development workflow, particularly, the steps of data preprocessing, and ML model training (e.g., randomness in train–test data splits). In addition, we also tested the proposed ML method on the U.S. Life Cycle Inventory Database, where we observed that the generalizability of the method was highly influenced by the database size of the application. To address these issues, we proposed that further research should focus on reducing the barriers in database integration such that both the size and balance of the data for ML method development can be improved.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Industrial Ecology 环境科学-环境科学

CiteScore

11.60

自引率

8.50%

发文量

117

审稿时长

12-24 weeks

期刊介绍： The Journal of Industrial Ecology addresses a series of related topics: material and energy flows studies (''industrial metabolism'') technological change dematerialization and decarbonization life cycle planning, design and assessment design for the environment extended producer responsibility (''product stewardship'') eco-industrial parks (''industrial symbiosis'') product-oriented environmental policy eco-efficiency Journal of Industrial Ecology is open to and encourages submissions that are interdisciplinary in approach. In addition to more formal academic papers, the journal seeks to provide a forum for continuing exchange of information and opinions through contributions from scholars, environmental managers, policymakers, advocates and others involved in environmental science, management and policy.