Inconsistency of LLMs in molecular representations

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY

Digital discovery Pub Date : 2025-08-08 DOI:10.1039/D5DD00176E

Bing Yan, Angelica Chen and Kyunghyun Cho

{"title":"Inconsistency of LLMs in molecular representations","authors":"Bing Yan, Angelica Chen and Kyunghyun Cho","doi":"10.1039/D5DD00176E","DOIUrl":null,"url":null,"abstract":"<p >Large language models (LLM) have demonstrated remarkable capabilities in chemistry, yet their ability to capture intrinsic chemistry remains uncertain. Within any familiar, chemically equivalent representation family, rigorous chemical reasoning should be representation-invariant, yielding consistent predictions across these representations. Here, we introduce the first systematic benchmark to evaluate the consistency of LLMs across key chemistry tasks. We curated the benchmark using paired representations of SMILES strings and IUPAC names. We find that the state-of-the-art general LLMs exhibit strikingly low consistency rates (≤1%). Even after finetuning on our dataset, the models still generate inconsistent predictions. To address this, we incorporate a sequence-level symmetric Kullback–Leibler (KL) divergence loss as a consistency regularizer. While this intervention improves surface-level consistency, it fails to enhance accuracy, suggesting that consistency and accuracy are orthogonal properties. These findings indicate that both consistency and accuracy must be considered to properly assess LLMs' capabilities in scientific reasoning.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 2876-2892"},"PeriodicalIF":6.2000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00176e?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00176e","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLM) have demonstrated remarkable capabilities in chemistry, yet their ability to capture intrinsic chemistry remains uncertain. Within any familiar, chemically equivalent representation family, rigorous chemical reasoning should be representation-invariant, yielding consistent predictions across these representations. Here, we introduce the first systematic benchmark to evaluate the consistency of LLMs across key chemistry tasks. We curated the benchmark using paired representations of SMILES strings and IUPAC names. We find that the state-of-the-art general LLMs exhibit strikingly low consistency rates (≤1%). Even after finetuning on our dataset, the models still generate inconsistent predictions. To address this, we incorporate a sequence-level symmetric Kullback–Leibler (KL) divergence loss as a consistency regularizer. While this intervention improves surface-level consistency, it fails to enhance accuracy, suggesting that consistency and accuracy are orthogonal properties. These findings indicate that both consistency and accuracy must be considered to properly assess LLMs' capabilities in scientific reasoning.

Abstract Image

查看原文本刊更多论文

法学硕士在分子表征上的不一致性

大型语言模型（LLM）已经在化学领域展示了非凡的能力，但它们捕捉内在化学的能力仍然不确定。在任何熟悉的化学等价表示族中，严格的化学推理应该是表示不变的，在这些表示中产生一致的预测。在这里，我们引入了第一个系统基准来评估llm在关键化学任务中的一致性。我们使用SMILES字符串和IUPAC名称的成对表示来策划基准测试。我们发现，最先进的一般法学硕士表现出惊人的低一致性率（≤1%）。即使在对我们的数据集进行微调之后，模型仍然会产生不一致的预测。为了解决这个问题，我们将序列级对称Kullback-Leibler （KL）散度损失作为一致性正则化器。虽然这种干预提高了表面水平的一致性，但却未能提高准确性，这表明一致性和准确性是正交的。这些发现表明，要正确评估法学硕士的科学推理能力，必须考虑一致性和准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital discovery

CiteScore

2.80

自引率

0.00%

发文量