Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction

IF 3.3 3区 化学 Q2 CHEMISTRY, PHYSICAL
Jiayun Pang, Ivan Vulić
{"title":"Specialising and Analysing Instruction-Tuned and Byte-Level Language Models for Organic Reaction Prediction","authors":"Jiayun Pang, Ivan Vulić","doi":"10.1039/d4fd00104d","DOIUrl":null,"url":null,"abstract":"Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.","PeriodicalId":76,"journal":{"name":"Faraday Discussions","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Faraday Discussions","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1039/d4fd00104d","RegionNum":3,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Transformer-based encoder-decoder models have demonstrated impressive results in chemical reaction prediction tasks. However, these models typically rely on pretraining using tens of millions of unlabelled molecules, which can be time-consuming and GPU-intensive. One of the central questions we aim to answer in this work is: Can FlanT5 and ByT5, the encode-decoder models pretrained solely on language data, be effectively specialised for organic reaction prediction through task-specific fine-tuning? We conduct a systematic empirical study on several key issues of the process, including tokenisation, the impact of (SMILES-oriented) pretraining, fine-tuning sample efficiency, and decoding algorithms at inference. Our key findings indicate that although being pretrained only on language tasks, FlanT5 and ByT5 provide a solid foundation to fine-tune for reaction prediction, and thus become 'chemistry domain compatible' in the process. This suggests that GPU-intensive and expensive pretraining on a large dataset of unlabelled molecules may be useful yet not essential to leverage the power of language models for chemistry. All our models achieve comparable Top-1 and Top-5 accuracy although some variation across different models does exist. Notably, tokenisation and vocabulary trimming slightly affect final performance but can speed up training and inference; The most efficient greedy decoding strategy is very competitive while only marginal gains can be achieved from more sophisticated decoding algorithms. In summary, we evaluate FlanT5 and ByT5 across several dimensions and benchmark their impact on organic reaction prediction, which may guide more effective use of these state-of-the-art language models for chemistry-related tasks in the future.
针对有机反应预测的指令调整和字节级语言模型的专业化与分析
基于变压器的编码器-解码器模型在化学反应预测任务中取得了令人瞩目的成果。然而,这些模型通常依赖于使用数千万个未标记的分子进行预训练,这不仅耗时,而且需要 GPU 密集型处理。在这项工作中,我们要回答的核心问题之一是FlanT5 和 ByT5(仅在语言数据上进行预训练的编码解码器模型)能否通过特定任务的微调有效地专门用于有机反应预测?我们对这一过程中的几个关键问题进行了系统的实证研究,包括标记化、(面向 SMILES 的)预训练的影响、微调样本效率以及推理时的解码算法。我们的主要研究结果表明,虽然 FlanT5 和 ByT5 只对语言任务进行了预训练,但它们为反应预测的微调打下了坚实的基础,从而在此过程中实现了 "化学领域兼容"。这表明,在大量未标记的分子数据集上进行 GPU 密集且昂贵的预训练,对于发挥化学语言模型的威力可能是有用的,但并非必不可少。尽管不同模型之间存在一些差异,但我们的所有模型都达到了相当的 Top-1 和 Top-5 准确率。值得注意的是,标记化和词汇修剪会略微影响最终性能,但可以加快训练和推理速度;最有效的贪婪解码策略非常有竞争力,而更复杂的解码算法只能取得微弱的收益。总之,我们从多个维度对 FlanT5 和 ByT5 进行了评估,并对它们对有机反应预测的影响进行了基准测试。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Faraday Discussions
Faraday Discussions 化学-物理化学
自引率
0.00%
发文量
259
期刊介绍: Discussion summary and research papers from discussion meetings that focus on rapidly developing areas of physical chemistry and its interfaces
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信