利用层次化文本反演进行高效数据分子生成

Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin
{"title":"利用层次化文本反演进行高效数据分子生成","authors":"Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin","doi":"arxiv-2405.02845","DOIUrl":null,"url":null,"abstract":"Developing an effective molecular generation framework even with a limited\nnumber of molecules is often important for its practical deployment, e.g., drug\ndiscovery, since acquiring task-related molecular data requires expensive and\ntime-consuming experimental costs. To tackle this issue, we introduce\nHierarchical textual Inversion for Molecular generation (HI-Mol), a novel\ndata-efficient molecular generation method. HI-Mol is inspired by the\nimportance of hierarchical information, e.g., both coarse- and fine-grained\nfeatures, in understanding the molecule distribution. We propose to use\nmulti-level embeddings to reflect such hierarchical features based on the\nadoption of the recent textual inversion technique in the visual domain, which\nachieves data-efficient image generation. Compared to the conventional textual\ninversion method in the image domain using a single-level token embedding, our\nmulti-level token embeddings allow the model to effectively learn the\nunderlying low-shot molecule distribution. We then generate molecules based on\nthe interpolation of the multi-level token embeddings. Extensive experiments\ndemonstrate the superiority of HI-Mol with notable data-efficiency. For\ninstance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x\nless training data. We also show the effectiveness of molecules generated by\nHI-Mol in low-shot molecular property prediction.","PeriodicalId":501325,"journal":{"name":"arXiv - QuanBio - Molecular Networks","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data-Efficient Molecular Generation with Hierarchical Textual Inversion\",\"authors\":\"Seojin Kim, Jaehyun Nam, Sihyun Yu, Younghoon Shin, Jinwoo Shin\",\"doi\":\"arxiv-2405.02845\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Developing an effective molecular generation framework even with a limited\\nnumber of molecules is often important for its practical deployment, e.g., drug\\ndiscovery, since acquiring task-related molecular data requires expensive and\\ntime-consuming experimental costs. To tackle this issue, we introduce\\nHierarchical textual Inversion for Molecular generation (HI-Mol), a novel\\ndata-efficient molecular generation method. HI-Mol is inspired by the\\nimportance of hierarchical information, e.g., both coarse- and fine-grained\\nfeatures, in understanding the molecule distribution. We propose to use\\nmulti-level embeddings to reflect such hierarchical features based on the\\nadoption of the recent textual inversion technique in the visual domain, which\\nachieves data-efficient image generation. Compared to the conventional textual\\ninversion method in the image domain using a single-level token embedding, our\\nmulti-level token embeddings allow the model to effectively learn the\\nunderlying low-shot molecule distribution. We then generate molecules based on\\nthe interpolation of the multi-level token embeddings. Extensive experiments\\ndemonstrate the superiority of HI-Mol with notable data-efficiency. For\\ninstance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x\\nless training data. We also show the effectiveness of molecules generated by\\nHI-Mol in low-shot molecular property prediction.\",\"PeriodicalId\":501325,\"journal\":{\"name\":\"arXiv - QuanBio - Molecular Networks\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Molecular Networks\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.02845\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Molecular Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02845","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

由于获取与任务相关的分子数据需要昂贵且耗时的实验成本,因此即使分子数量有限,开发一个有效的分子生成框架对于其实际应用(例如药物发现)也非常重要。为了解决这个问题,我们引入了分子生成的分层文本反演(HI-Mol),这是一种新颖的数据高效分子生成方法。HI-Mol 的灵感来自层次信息(例如粗粒度和细粒度特征)在理解分子分布方面的重要性。我们建议在视觉领域采用最新的文本反演技术的基础上,使用多层次嵌入来反映这种层次特征,从而实现数据高效的图像生成。与在图像领域使用单级标记嵌入的传统文本反演方法相比,我们的多级标记嵌入可以让模型有效地学习底层低照分子分布。然后,我们根据多级标记嵌入的插值生成分子。大量实验证明,HI-Mol 具有显著的数据效率优势。例如,在 QM9 上,HI-Mol 在训练数据减少 50 倍的情况下,性能超过了之前最先进的方法。我们还展示了 HI-Mol 生成的分子在低射分子性质预测中的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Data-Efficient Molecular Generation with Hierarchical Textual Inversion
Developing an effective molecular generation framework even with a limited number of molecules is often important for its practical deployment, e.g., drug discovery, since acquiring task-related molecular data requires expensive and time-consuming experimental costs. To tackle this issue, we introduce Hierarchical textual Inversion for Molecular generation (HI-Mol), a novel data-efficient molecular generation method. HI-Mol is inspired by the importance of hierarchical information, e.g., both coarse- and fine-grained features, in understanding the molecule distribution. We propose to use multi-level embeddings to reflect such hierarchical features based on the adoption of the recent textual inversion technique in the visual domain, which achieves data-efficient image generation. Compared to the conventional textual inversion method in the image domain using a single-level token embedding, our multi-level token embeddings allow the model to effectively learn the underlying low-shot molecule distribution. We then generate molecules based on the interpolation of the multi-level token embeddings. Extensive experiments demonstrate the superiority of HI-Mol with notable data-efficiency. For instance, on QM9, HI-Mol outperforms the prior state-of-the-art method with 50x less training data. We also show the effectiveness of molecules generated by HI-Mol in low-shot molecular property prediction.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信