A Large Language Model-based Framework to Retrieve Life Cycle Inventory and Environmental Impact Data from Scientific Literature

IF 11.3 1区环境科学与生态学 Q1 ENGINEERING, ENVIRONMENTAL

环境科学与技术 Pub Date : 2025-10-16 DOI:10.1021/acs.est.5c05955

Avan Kumar, Farshid Nazemi, Hariprasad Kodamana, Manojkumar Ramteke, Bhavik R. Bakshi

{"title":"A Large Language Model-based Framework to Retrieve Life Cycle Inventory and Environmental Impact Data from Scientific Literature","authors":"Avan Kumar, Farshid Nazemi, Hariprasad Kodamana, Manojkumar Ramteke, Bhavik R. Bakshi","doi":"10.1021/acs.est.5c05955","DOIUrl":null,"url":null,"abstract":"Life cycle assessment (LCA) quantifies environmental impacts from raw material extraction to end-of-life (EoL) treatment, yet its accuracy depends on reliable life cycle inventory (LCI) data. However, obtaining such data is time-consuming and requires an extensive literature review or access to databases that are often behind paywalls that hinder transparent research. This study introduces a systematic framework leveraging a retrained large language model (LLM) to assist LCA practitioners in retrieving LCI data and insightful information about their environmental impact. The framework follows a three-stage process: (i) a fine-tuned classification model identifies relevant documents, (ii) the LLaMA-2-7B model is pretrained on selected texts to inject domain knowledge into its database, and (iii) a fine-tuned Q&A model extracts LCI and environmental impact data from the scientific literature. The resulting LLM is termed as “Sustain-LLaMA”. We implement this framework in two cases: methanol production and plastic packaging EoL treatment. After retraining, the classification models achieve high accuracies (0.850 for methanol, 0.952 for plastic packaging) for unseen data, which means effectively distinguishing relevant studies. The Q&A models with Retrieval Augmentated Generation (RAG) yield F1 scores of 0.823 for methanol and 0.855 for plastic studies. The Q&A models’ performances are validated against the version of LLaMA-2-7B without retraining, ChatGPT-4o, and the USLCI database, demonstrating comparable or superior accuracy and efficiency. This framework enhances scalability and precision by automating LCI data retrieval, offering a promising tool for guiding the chemical and plastic industries toward sustainability.","PeriodicalId":36,"journal":{"name":"环境科学与技术","volume":"75 1","pages":""},"PeriodicalIF":11.3000,"publicationDate":"2025-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"环境科学与技术","FirstCategoryId":"1","ListUrlMain":"https://doi.org/10.1021/acs.est.5c05955","RegionNum":1,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ENVIRONMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Life cycle assessment (LCA) quantifies environmental impacts from raw material extraction to end-of-life (EoL) treatment, yet its accuracy depends on reliable life cycle inventory (LCI) data. However, obtaining such data is time-consuming and requires an extensive literature review or access to databases that are often behind paywalls that hinder transparent research. This study introduces a systematic framework leveraging a retrained large language model (LLM) to assist LCA practitioners in retrieving LCI data and insightful information about their environmental impact. The framework follows a three-stage process: (i) a fine-tuned classification model identifies relevant documents, (ii) the LLaMA-2-7B model is pretrained on selected texts to inject domain knowledge into its database, and (iii) a fine-tuned Q&A model extracts LCI and environmental impact data from the scientific literature. The resulting LLM is termed as “Sustain-LLaMA”. We implement this framework in two cases: methanol production and plastic packaging EoL treatment. After retraining, the classification models achieve high accuracies (0.850 for methanol, 0.952 for plastic packaging) for unseen data, which means effectively distinguishing relevant studies. The Q&A models with Retrieval Augmentated Generation (RAG) yield F1 scores of 0.823 for methanol and 0.855 for plastic studies. The Q&A models’ performances are validated against the version of LLaMA-2-7B without retraining, ChatGPT-4o, and the USLCI database, demonstrating comparable or superior accuracy and efficiency. This framework enhances scalability and precision by automating LCI data retrieval, offering a promising tool for guiding the chemical and plastic industries toward sustainability.

查看原文本刊更多论文

从科学文献中检索生命周期清单和环境影响数据的基于大型语言模型的框架

生命周期评估（LCA）量化了从原材料提取到生命周期结束（EoL）处理的环境影响，但其准确性取决于可靠的生命周期清单（LCI）数据。然而，获得这样的数据是耗时的，需要大量的文献回顾或访问数据库，而这些数据库往往是在阻碍透明研究的付费墙后面。本研究引入了一个系统框架，利用再训练的大型语言模型（LLM）来帮助LCA从业者检索LCI数据和有关其环境影响的深刻信息。该框架遵循三个阶段的过程：(i)一个微调的分类模型识别相关文档，（ii） LLaMA-2-7B模型在选定文本上进行预训练，将领域知识注入其数据库，（iii）一个微调的Q&； a模型从科学文献中提取LCI和环境影响数据。由此产生的法学硕士被称为“Sustain-LLaMA”。我们在两种情况下实施该框架：甲醇生产和塑料包装EoL处理。经过再训练，对未见数据的分类模型达到了较高的准确率（甲醇为0.850，塑料包装为0.952），可以有效地区分相关研究。具有检索增强生成（RAG）的Q&；A模型的甲醇和塑料研究的F1得分分别为0.823和0.855。Q&；A模型的性能与未经再训练的LLaMA-2-7B版本、chatgpt - 40和USLCI数据库进行了验证，显示出相当或更高的准确性和效率。该框架通过自动化LCI数据检索提高了可扩展性和精度，为指导化学和塑料行业走向可持续发展提供了一个有前途的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

环境科学与技术环境科学-工程：环境

CiteScore

17.50

自引率

9.60%

发文量

12359

审稿时长

2.8 months

期刊介绍： Environmental Science & Technology (ES&T) is a co-sponsored academic and technical magazine by the Hubei Provincial Environmental Protection Bureau and the Hubei Provincial Academy of Environmental Sciences. Environmental Science & Technology (ES&T) holds the status of Chinese core journals, scientific papers source journals of China, Chinese Science Citation Database source journals, and Chinese Academic Journal Comprehensive Evaluation Database source journals. This publication focuses on the academic field of environmental protection, featuring articles related to environmental protection and technical advancements.