Juan Xiang, Yizhang Li, Xinyi Zhang, Yu He, Qiang Sun
{"title":"Local large language model-assisted literature mining for on-surface reactions","authors":"Juan Xiang, Yizhang Li, Xinyi Zhang, Yu He, Qiang Sun","doi":"10.1002/mgea.88","DOIUrl":null,"url":null,"abstract":"<p>Large language models (LLMs) excel at extracting information from literatures. However, deploying LLMs necessitates substantial computational resources, and security concerns with online LLMs pose a challenge to their wider applications. Herein, we introduce a method for extracting scientific data from unstructured texts using a local LLM, exemplifying its applications to scientific literatures on the topic of on-surface reactions. By combining prompt engineering and multi-step text preprocessing, we show that the local LLM can effectively extract scientific information, achieving a recall rate of 91% and a precision rate of 70%. Moreover, despite significant differences in model parameter size, the performance of the local LLM is comparable to that of GPT-3.5 turbo (81% recall, 84% precision) and GPT-4o (85% recall, 87% precision). The simplicity, versatility, reduced computational requirements, and enhanced privacy of the local LLM makes it highly promising for data mining, with the potential to accelerate the application and development of LLMs across various fields.</p>","PeriodicalId":100889,"journal":{"name":"Materials Genome Engineering Advances","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/mgea.88","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Materials Genome Engineering Advances","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/mgea.88","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) excel at extracting information from literatures. However, deploying LLMs necessitates substantial computational resources, and security concerns with online LLMs pose a challenge to their wider applications. Herein, we introduce a method for extracting scientific data from unstructured texts using a local LLM, exemplifying its applications to scientific literatures on the topic of on-surface reactions. By combining prompt engineering and multi-step text preprocessing, we show that the local LLM can effectively extract scientific information, achieving a recall rate of 91% and a precision rate of 70%. Moreover, despite significant differences in model parameter size, the performance of the local LLM is comparable to that of GPT-3.5 turbo (81% recall, 84% precision) and GPT-4o (85% recall, 87% precision). The simplicity, versatility, reduced computational requirements, and enhanced privacy of the local LLM makes it highly promising for data mining, with the potential to accelerate the application and development of LLMs across various fields.