M³-20M: A large-scale multi-modal molecule dataset for AI-driven drug design and discovery.

IF 0.7 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Bioinformatics and Computational Biology Pub Date : 2025-04-01 Epub Date: 2025-06-04 DOI:10.1142/S0219720025500064

Siyuan Guo, Lexuan Wang, Chang Jin, Jinxian Wang, Han Peng, Huayang Shi, Wengen Li, Jihong Guan, Shuigeng Zhou

{"title":"M3-20M: A large-scale multi-modal molecule dataset for AI-driven drug design and discovery.","authors":"Siyuan Guo, Lexuan Wang, Chang Jin, Jinxian Wang, Han Peng, Huayang Shi, Wengen Li, Jihong Guan, Shuigeng Zhou","doi":"10.1142/S0219720025500064","DOIUrl":null,"url":null,"abstract":"This paper introduces M3-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M3-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M3-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M3-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M3-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.","PeriodicalId":48910,"journal":{"name":"Journal of Bioinformatics and Computational Biology","volume":"23 2","pages":"2550006"},"PeriodicalIF":0.7000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Bioinformatics and Computational Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1142/S0219720025500064","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/4 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

This paper introduces M³-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M³-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M³-20M in drug design and discovery, we conduct extensive experiments on two key tasks: molecule generation and molecular property prediction, using large language models including GLM4, GPT-3.5, GPT-4, and Llama3-8b. Our experimental results show that M³-20M can significantly boost model performance in both tasks. Specifically, it enables the models to generate more diverse and valid molecular structures and achieve higher property prediction accuracy than existing single-modal datasets, which validates the value and potential of M³-20M in supporting AI-driven drug design and discovery. The dataset is available at https://github.com/bz99bz/M-3.

查看原文本刊更多论文

M3-20M：用于ai驱动的药物设计和发现的大规模多模态分子数据集。

本文介绍了包含超过2000万个分子的大型多模态分子数据集M3-20M，数据主要来自现有数据库集成，部分由大型语言模型生成。M3-20M旨在支持人工智能驱动的药物设计和发现，其分子数量是现有最大数据集的71倍，提供了前所未有的规模，可以高度有利于模型的训练或微调，包括用于药物设计和发现任务的大型语言模型。该数据集集成了一维smile、二维分子图、三维分子结构、物理化学性质和通过网络爬行收集的文本描述，并使用GPT-3.5生成，提供了每个分子的全面视图。为了证明M3-20M在药物设计和发现中的强大作用，我们使用GLM4、GPT-3.5、GPT-4和Llama3-8b等大型语言模型，对分子生成和分子性质预测两项关键任务进行了广泛的实验。我们的实验结果表明，M3-20M可以显著提高模型在这两个任务中的性能。具体而言，它使模型能够生成更多样化和有效的分子结构，并且比现有的单模态数据集实现更高的性质预测精度，这验证了M3-20M在支持ai驱动的药物设计和发现方面的价值和潜力。该数据集可在https://github.com/bz99bz/M-3上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Bioinformatics and Computational Biology MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

2.10

自引率

0.00%

发文量

期刊介绍： The Journal of Bioinformatics and Computational Biology aims to publish high quality, original research articles, expository tutorial papers and review papers as well as short, critical comments on technical issues associated with the analysis of cellular information. The research papers will be technical presentations of new assertions, discoveries and tools, intended for a narrower specialist community. The tutorials, reviews and critical commentary will be targeted at a broader readership of biologists who are interested in using computers but are not knowledgeable about scientific computing, and equally, computer scientists who have an interest in biology but are not familiar with current thrusts nor the language of biology. Such carefully chosen tutorials and articles should greatly accelerate the rate of entry of these new creative scientists into the field.