Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua, Yue Yu
{"title":"利用大型语言模型进行数据稀缺的聚合物性质学习。","authors":"Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua, Yue Yu","doi":"10.1038/s43588-025-00768-y","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for fine-tuning. To this end, here we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before fine-tuning. Our framework features a two-phase training strategy: utilizing the large-in-amount but less accurate synthetic data for supervised pretraining, and fine-tuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate fine-tuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data are sparse. A physics-based training pipeline is developed to help tackle the challenges of data scarcity. The framework aligns large language models to a physically consistent initial state that is fine-tuned for learning polymer properties.","PeriodicalId":74246,"journal":{"name":"Nature computational science","volume":"5 3","pages":"245-254"},"PeriodicalIF":12.0000,"publicationDate":"2025-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Harnessing large language models for data-scarce learning of polymer properties\",\"authors\":\"Ning Liu, Siavash Jafarzadeh, Brian Y. Lattimer, Shuna Ni, Jim Lua, Yue Yu\",\"doi\":\"10.1038/s43588-025-00768-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for fine-tuning. To this end, here we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before fine-tuning. Our framework features a two-phase training strategy: utilizing the large-in-amount but less accurate synthetic data for supervised pretraining, and fine-tuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate fine-tuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data are sparse. A physics-based training pipeline is developed to help tackle the challenges of data scarcity. The framework aligns large language models to a physically consistent initial state that is fine-tuned for learning polymer properties.\",\"PeriodicalId\":74246,\"journal\":{\"name\":\"Nature computational science\",\"volume\":\"5 3\",\"pages\":\"245-254\"},\"PeriodicalIF\":12.0000,\"publicationDate\":\"2025-02-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nature computational science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.nature.com/articles/s43588-025-00768-y\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature computational science","FirstCategoryId":"1085","ListUrlMain":"https://www.nature.com/articles/s43588-025-00768-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
Harnessing large language models for data-scarce learning of polymer properties
Large language models (LLMs) bear promise as a fast and accurate material modeling paradigm for evaluation, analysis and design. Their vast number of trainable parameters necessitates a wealth of data to achieve accuracy and mitigate overfitting. However, experimental measurements are often limited and costly to obtain in sufficient quantities for fine-tuning. To this end, here we present a physics-based training pipeline that tackles the pathology of data scarcity. The core enabler is a physics-based modeling framework that generates a multitude of synthetic data to align the LLM to a physically consistent initial state before fine-tuning. Our framework features a two-phase training strategy: utilizing the large-in-amount but less accurate synthetic data for supervised pretraining, and fine-tuning the phase-1 model with limited experimental data. We empirically demonstrate that supervised pretraining is vital to obtaining accurate fine-tuned LLMs, via the lens of learning polymer flammability metrics where cone calorimeter data are sparse. A physics-based training pipeline is developed to help tackle the challenges of data scarcity. The framework aligns large language models to a physically consistent initial state that is fine-tuned for learning polymer properties.