Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan
{"title":"Regression with Large Language Models for Materials and Molecular Property Prediction","authors":"Ryan Jacobs, Maciej P. Polak, Lane E. Schultz, Hamed Mahdavi, Vasant Honavar, Dane Morgan","doi":"arxiv-2409.06080","DOIUrl":null,"url":null,"abstract":"We demonstrate the ability of large language models (LLMs) to perform\nmaterial and molecular property regression tasks, a significant deviation from\nthe conventional LLM use case. We benchmark the Large Language Model Meta AI\n(LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials\nproperties. Only composition-based input strings are used as the model input\nand we fine tune on only the generative loss. We broadly find that LLaMA 3,\nwhen fine-tuned using the SMILES representation of molecules, provides useful\nregression results which can rival standard materials property prediction\nmodels like random forest or fully connected neural networks on the QM9\ndataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the\nstate-of-the-art models that were trained using far more granular\nrepresentation of molecules (e.g., atom types and their coordinates) for the\nsame task. Interestingly, LLaMA 3 provides improved predictions compared to\nGPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting\nthat LLM-like generative models can potentially transcend their traditional\napplications to tackle complex physical phenomena, thus paving the way for\nfuture research and applications in chemistry, materials science and other\nscientific domains.","PeriodicalId":501234,"journal":{"name":"arXiv - PHYS - Materials Science","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Materials Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We demonstrate the ability of large language models (LLMs) to perform
material and molecular property regression tasks, a significant deviation from
the conventional LLM use case. We benchmark the Large Language Model Meta AI
(LLaMA) 3 on several molecular properties in the QM9 dataset and 24 materials
properties. Only composition-based input strings are used as the model input
and we fine tune on only the generative loss. We broadly find that LLaMA 3,
when fine-tuned using the SMILES representation of molecules, provides useful
regression results which can rival standard materials property prediction
models like random forest or fully connected neural networks on the QM9
dataset. Not surprisingly, LLaMA 3 errors are 5-10x higher than those of the
state-of-the-art models that were trained using far more granular
representation of molecules (e.g., atom types and their coordinates) for the
same task. Interestingly, LLaMA 3 provides improved predictions compared to
GPT-3.5 and GPT-4o. This work highlights the versatility of LLMs, suggesting
that LLM-like generative models can potentially transcend their traditional
applications to tackle complex physical phenomena, thus paving the way for
future research and applications in chemistry, materials science and other
scientific domains.