GenBench:用于系统评估基因组基础模型的基准套件

Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li
{"title":"GenBench:用于系统评估基因组基础模型的基准套件","authors":"Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li","doi":"arxiv-2406.01627","DOIUrl":null,"url":null,"abstract":"The Genomic Foundation Model (GFM) paradigm is expected to facilitate the\nextraction of generalizable representations from massive genomic data, thereby\nenabling their application across a spectrum of downstream applications.\nDespite advancements, a lack of evaluation framework makes it difficult to\nensure equitable assessment due to experimental settings, model intricacy,\nbenchmark datasets, and reproducibility challenges. In the absence of\nstandardization, comparative analyses risk becoming biased and unreliable. To\nsurmount this impasse, we introduce GenBench, a comprehensive benchmarking\nsuite specifically tailored for evaluating the efficacy of Genomic Foundation\nModels. GenBench offers a modular and expandable framework that encapsulates a\nvariety of state-of-the-art methodologies. Through systematic evaluations of\ndatasets spanning diverse biological domains with a particular emphasis on both\nshort-range and long-range genomic tasks, firstly including the three most\nimportant DNA tasks covering Coding Region, Non-Coding Region, Genome\nStructure, etc. Moreover, We provide a nuanced analysis of the interplay\nbetween model architecture and dataset characteristics on task-specific\nperformance. Our findings reveal an interesting observation: independent of the\nnumber of parameters, the discernible difference in preference between the\nattention-based and convolution-based models on short- and long-range tasks may\nprovide insights into the future design of GFM.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"25 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models\",\"authors\":\"Zicheng Liu, Jiahui Li, Siyuan Li, Zelin Zang, Cheng Tan, Yufei Huang, Yajing Bai, Stan Z. Li\",\"doi\":\"arxiv-2406.01627\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Genomic Foundation Model (GFM) paradigm is expected to facilitate the\\nextraction of generalizable representations from massive genomic data, thereby\\nenabling their application across a spectrum of downstream applications.\\nDespite advancements, a lack of evaluation framework makes it difficult to\\nensure equitable assessment due to experimental settings, model intricacy,\\nbenchmark datasets, and reproducibility challenges. In the absence of\\nstandardization, comparative analyses risk becoming biased and unreliable. To\\nsurmount this impasse, we introduce GenBench, a comprehensive benchmarking\\nsuite specifically tailored for evaluating the efficacy of Genomic Foundation\\nModels. GenBench offers a modular and expandable framework that encapsulates a\\nvariety of state-of-the-art methodologies. Through systematic evaluations of\\ndatasets spanning diverse biological domains with a particular emphasis on both\\nshort-range and long-range genomic tasks, firstly including the three most\\nimportant DNA tasks covering Coding Region, Non-Coding Region, Genome\\nStructure, etc. Moreover, We provide a nuanced analysis of the interplay\\nbetween model architecture and dataset characteristics on task-specific\\nperformance. Our findings reveal an interesting observation: independent of the\\nnumber of parameters, the discernible difference in preference between the\\nattention-based and convolution-based models on short- and long-range tasks may\\nprovide insights into the future design of GFM.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"25 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.01627\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.01627","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

基因组基础模型(GFM)范式有望促进从海量基因组数据中提取可通用的表征,从而使其能够应用于各种下游应用。尽管取得了进展,但由于实验设置、模型复杂性、基准数据集和可重复性方面的挑战,评估框架的缺乏使公平评估难以得到保证。在缺乏标准化的情况下,比较分析有可能变得有失偏颇和不可靠。为了打破这一僵局,我们推出了 GenBench,这是一个专门用于评估基因组基础模型功效的综合基准套件。GenBench 提供了一个模块化、可扩展的框架,囊括了各种最先进的方法。通过对横跨不同生物领域的数据集进行系统评估,特别强调短程和远程基因组任务,首先包括三个最重要的 DNA 任务,涵盖编码区、非编码区、基因组结构等。此外,我们还对模型架构和数据集特征之间的相互作用进行了细致的分析。我们的发现揭示了一个有趣的现象:与参数数量无关,基于注意力的模型和基于卷积的模型在短程和远程任务上存在明显的偏好差异,这可能会为未来的 GFM 设计提供启示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models
The Genomic Foundation Model (GFM) paradigm is expected to facilitate the extraction of generalizable representations from massive genomic data, thereby enabling their application across a spectrum of downstream applications. Despite advancements, a lack of evaluation framework makes it difficult to ensure equitable assessment due to experimental settings, model intricacy, benchmark datasets, and reproducibility challenges. In the absence of standardization, comparative analyses risk becoming biased and unreliable. To surmount this impasse, we introduce GenBench, a comprehensive benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models. GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies. Through systematic evaluations of datasets spanning diverse biological domains with a particular emphasis on both short-range and long-range genomic tasks, firstly including the three most important DNA tasks covering Coding Region, Non-Coding Region, Genome Structure, etc. Moreover, We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance. Our findings reveal an interesting observation: independent of the number of parameters, the discernible difference in preference between the attention-based and convolution-based models on short- and long-range tasks may provide insights into the future design of GFM.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信