CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines

Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai
{"title":"CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines","authors":"Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai","doi":"arxiv-2409.02143","DOIUrl":null,"url":null,"abstract":"Machine learning has shown great potential in the field of cancer multi-omics\nstudies, offering incredible opportunities for advancing precision medicine.\nHowever, the challenges associated with dataset curation and task formulation\npose significant hurdles, especially for researchers lacking a biomedical\nbackground. Here, we introduce the CMOB, the first large-scale cancer\nmulti-omics benchmark integrates the TCGA platform, making data resources\naccessible and usable for machine learning researchers without significant\npreparation and expertise.To date, CMOB includes a collection of 20 cancer\nmulti-omics datasets covering 32 cancers, accompanied by a systematic data\nprocessing pipeline. CMOB provides well-processed dataset versions to support\n20 meaningful tasks in four studies, with a collection of benchmarks. We also\nintegrate CMOB with two complementary resources and various biological tools to\nexplore broader research avenues.All resources are open-accessible with\nuser-friendly and compatible integration scripts that enable non-experts to\neasily incorporate this complementary information for various tasks. We conduct\nextensive experiments on selected datasets to offer recommendations on suitable\nmachine learning baselines for specific applications. Through CMOB, we aim to\nfacilitate algorithmic advances and hasten the development, validation, and\nclinical translation of machine-learning models for personalized cancer\ntreatments. CMOB is available on GitHub\n(\\url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Machine learning has shown great potential in the field of cancer multi-omics studies, offering incredible opportunities for advancing precision medicine. However, the challenges associated with dataset curation and task formulation pose significant hurdles, especially for researchers lacking a biomedical background. Here, we introduce the CMOB, the first large-scale cancer multi-omics benchmark integrates the TCGA platform, making data resources accessible and usable for machine learning researchers without significant preparation and expertise.To date, CMOB includes a collection of 20 cancer multi-omics datasets covering 32 cancers, accompanied by a systematic data processing pipeline. CMOB provides well-processed dataset versions to support 20 meaningful tasks in four studies, with a collection of benchmarks. We also integrate CMOB with two complementary resources and various biological tools to explore broader research avenues.All resources are open-accessible with user-friendly and compatible integration scripts that enable non-experts to easily incorporate this complementary information for various tasks. We conduct extensive experiments on selected datasets to offer recommendations on suitable machine learning baselines for specific applications. Through CMOB, we aim to facilitate algorithmic advances and hasten the development, validation, and clinical translation of machine-learning models for personalized cancer treatments. CMOB is available on GitHub (\url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).
CMOB:具有开放数据集、任务和基线的大规模癌症多指标基准测试
机器学习在癌症多组学研究领域显示出巨大的潜力,为推进精准医疗提供了难以置信的机遇。然而,与数据集整理和任务制定相关的挑战带来了巨大的障碍,尤其是对于缺乏生物医学背景的研究人员而言。在这里,我们介绍 CMOB,它是第一个集成了 TCGA 平台的大规模癌症多组学基准,使机器学习研究人员无需大量准备工作和专业知识就能获得和使用数据资源。迄今为止,CMOB 包括 20 个癌症多组学数据集,涵盖 32 种癌症,并附有系统的数据处理管道。CMOB 提供了经过良好处理的数据集版本,以支持四项研究中 20 项有意义的任务,并提供了一系列基准。我们还将 CMOB 与两个补充资源和各种生物工具进行了整合,以探索更广泛的研究途径。所有资源都是开放式的,具有用户友好和兼容的整合脚本,使非专业人员也能轻松地将这些补充信息整合到各种任务中。我们在选定的数据集上进行大量实验,为特定应用提供合适的机器学习基线建议。通过 CMOB,我们的目标是促进算法进步,加快个性化癌症治疗机器学习模型的开发、验证和临床转化。CMOB可在GitHub(\url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark})上下载。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信