Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction†

IF 3.1 3区化学 Q2 Chemistry

Faraday Discussions Pub Date : 2024-08-19 DOI:10.1039/D4FD00104D

Jiayun Pang and Ivan Vulić

{"title":"Specialising and analysing instruction-tuned and byte-level language models for organic reaction prediction†","authors":"Jiayun Pang and Ivan Vulić","doi":"10.1039/D4FD00104D","DOIUrl":null,"url":null,"abstract":"<p >Transformer-based encoder–decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: can FlanT5 and ByT5, the encoder–decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become ‘chemistry domain compatible’ in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential, to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; the most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.</p>","PeriodicalId":49075,"journal":{"name":"Faraday Discussions","volume":"256 ","pages":" 413-433"},"PeriodicalIF":3.1000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/fd/d4fd00104d?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Faraday Discussions","FirstCategoryId":"92","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/fd/d4fd00104d","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Chemistry","Score":null,"Total":0}

引用次数: 0

Abstract

Transformer-based encoder–decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: can FlanT5 and ByT5, the encoder–decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become ‘chemistry domain compatible’ in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential, to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; the most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.

Abstract Image

查看原文本刊更多论文

针对有机反应预测的指令调整和字节级语言模型的专业化与分析

基于变压器的编码器-解码器模型在化学反应预测任务中取得了令人瞩目的成果。然而，这些模型通常依赖于使用数千万个未标记的分子进行预训练，这不仅耗时，而且需要 GPU 密集型处理。在这项工作中，我们要回答的核心问题之一是FlanT5 和 ByT5（仅在语言数据上进行预训练的编码解码器模型）能否通过特定任务的微调有效地专门用于有机反应预测？我们对这一过程中的几个关键问题进行了系统的实证研究，包括标记化、（面向 SMILES 的）预训练的影响、微调样本效率以及推理时的解码算法。我们的主要研究结果表明，虽然 FlanT5 和 ByT5 只对语言任务进行了预训练，但它们为反应预测的微调打下了坚实的基础，从而在此过程中实现了 "化学领域兼容"。这表明，在大量未标记的分子数据集上进行 GPU 密集且昂贵的预训练，对于发挥化学语言模型的威力可能是有用的，但并非必不可少。尽管不同模型之间存在一些差异，但我们的所有模型都达到了相当的 Top-1 和 Top-5 准确率。值得注意的是，标记化和词汇修剪会略微影响最终性能，但可以加快训练和推理速度；最有效的贪婪解码策略非常有竞争力，而更复杂的解码算法只能取得微弱的收益。总之，我们从多个维度对 FlanT5 和 ByT5 进行了评估，并对它们对有机反应预测的影响进行了基准测试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊