Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun
{"title":"A Benchmark Dataset for Multimodal Prediction of Enzymatic Function Coupling DNA Sequences and Natural Language","authors":"Yuchen Zhang, Ratish Kumar Chandrakant Jha, Soumya Bharadwaj, Vatsal Sanjaykumar Thakkar, Adrienne Hoarfrost, Jin Sun","doi":"arxiv-2407.15888","DOIUrl":null,"url":null,"abstract":"Predicting gene function from its DNA sequence is a fundamental challenge in\nbiology. Many deep learning models have been proposed to embed DNA sequences\nand predict their enzymatic function, leveraging information in public\ndatabases linking DNA sequences to an enzymatic function label. However, much\nof the scientific community's knowledge of biological function is not\nrepresented in these categorical labels, and is instead captured in\nunstructured text descriptions of mechanisms, reactions, and enzyme behavior.\nThese descriptions are often captured alongside DNA sequences in biological\ndatabases, albeit in an unstructured manner. Deep learning of models predicting\nenzymatic function are likely to benefit from incorporating this multi-modal\ndata encoding scientific knowledge of biological function. There is, however,\nno dataset designed for machine learning algorithms to leverage this\nmulti-modal information. Here we propose a novel dataset and benchmark suite\nthat enables the exploration and development of large multi-modal neural\nnetwork models on gene DNA sequences and natural language descriptions of gene\nfunction. We present baseline performance on benchmarks for both unsupervised\nand supervised tasks that demonstrate the difficulty of this modeling\nobjective, while demonstrating the potential benefit of incorporating\nmulti-modal data types in function prediction compared to DNA sequences alone.\nOur dataset is at: https://hoarfrost-lab.github.io/BioTalk/.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"63 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.15888","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Predicting gene function from its DNA sequence is a fundamental challenge in
biology. Many deep learning models have been proposed to embed DNA sequences
and predict their enzymatic function, leveraging information in public
databases linking DNA sequences to an enzymatic function label. However, much
of the scientific community's knowledge of biological function is not
represented in these categorical labels, and is instead captured in
unstructured text descriptions of mechanisms, reactions, and enzyme behavior.
These descriptions are often captured alongside DNA sequences in biological
databases, albeit in an unstructured manner. Deep learning of models predicting
enzymatic function are likely to benefit from incorporating this multi-modal
data encoding scientific knowledge of biological function. There is, however,
no dataset designed for machine learning algorithms to leverage this
multi-modal information. Here we propose a novel dataset and benchmark suite
that enables the exploration and development of large multi-modal neural
network models on gene DNA sequences and natural language descriptions of gene
function. We present baseline performance on benchmarks for both unsupervised
and supervised tasks that demonstrate the difficulty of this modeling
objective, while demonstrating the potential benefit of incorporating
multi-modal data types in function prediction compared to DNA sequences alone.
Our dataset is at: https://hoarfrost-lab.github.io/BioTalk/.
从 DNA 序列预测基因功能是生物学的一项基本挑战。许多深度学习模型被提出来嵌入 DNA 序列并预测其酶功能,利用公共数据库中的信息将 DNA 序列与酶功能标签联系起来。然而,科学界关于生物功能的大部分知识并没有体现在这些分类标签中,而是体现在关于机制、反应和酶行为的非结构化文本描述中。这些描述通常与生物数据库中的 DNA 序列一起被捕获,尽管是以非结构化的方式捕获的。预测酶功能的深度学习模型很可能受益于这些编码生物功能科学知识的多模式数据。然而,目前还没有专为机器学习算法设计的数据集来利用这些多模态信息。在这里,我们提出了一个新颖的数据集和基准套装,它可以在基因 DNA 序列和基因功能自然语言描述上探索和开发大型多模态神经网络模型。我们展示了无监督和有监督任务的基准性能,证明了这一建模目标的难度,同时也证明了与单独的 DNA 序列相比,在功能预测中纳入多模态数据类型的潜在好处。我们的数据集在:https://hoarfrost-lab.github.io/BioTalk/。