CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines

arXiv - QuanBio - Genomics Pub Date : 2024-09-02 DOI:arxiv-2409.02143

Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai

{"title":"CMOB: Large-Scale Cancer Multi-Omics Benchmark with Open Datasets, Tasks, and Baselines","authors":"Ziwei Yang, Rikuto Kotoge, Zheng Chen, Xihao Piao, Yasuko Matsubara, Yasushi Sakurai","doi":"arxiv-2409.02143","DOIUrl":null,"url":null,"abstract":"Machine learning has shown great potential in the field of cancer multi-omics\nstudies, offering incredible opportunities for advancing precision medicine.\nHowever, the challenges associated with dataset curation and task formulation\npose significant hurdles, especially for researchers lacking a biomedical\nbackground. Here, we introduce the CMOB, the first large-scale cancer\nmulti-omics benchmark integrates the TCGA platform, making data resources\naccessible and usable for machine learning researchers without significant\npreparation and expertise.To date, CMOB includes a collection of 20 cancer\nmulti-omics datasets covering 32 cancers, accompanied by a systematic data\nprocessing pipeline. CMOB provides well-processed dataset versions to support\n20 meaningful tasks in four studies, with a collection of benchmarks. We also\nintegrate CMOB with two complementary resources and various biological tools to\nexplore broader research avenues.All resources are open-accessible with\nuser-friendly and compatible integration scripts that enable non-experts to\neasily incorporate this complementary information for various tasks. We conduct\nextensive experiments on selected datasets to offer recommendations on suitable\nmachine learning baselines for specific applications. Through CMOB, we aim to\nfacilitate algorithmic advances and hasten the development, validation, and\nclinical translation of machine-learning models for personalized cancer\ntreatments. CMOB is available on GitHub\n(\\url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.02143","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Machine learning has shown great potential in the field of cancer multi-omics studies, offering incredible opportunities for advancing precision medicine. However, the challenges associated with dataset curation and task formulation pose significant hurdles, especially for researchers lacking a biomedical background. Here, we introduce the CMOB, the first large-scale cancer multi-omics benchmark integrates the TCGA platform, making data resources accessible and usable for machine learning researchers without significant preparation and expertise.To date, CMOB includes a collection of 20 cancer multi-omics datasets covering 32 cancers, accompanied by a systematic data processing pipeline. CMOB provides well-processed dataset versions to support 20 meaningful tasks in four studies, with a collection of benchmarks. We also integrate CMOB with two complementary resources and various biological tools to explore broader research avenues.All resources are open-accessible with user-friendly and compatible integration scripts that enable non-experts to easily incorporate this complementary information for various tasks. We conduct extensive experiments on selected datasets to offer recommendations on suitable machine learning baselines for specific applications. Through CMOB, we aim to facilitate algorithmic advances and hasten the development, validation, and clinical translation of machine-learning models for personalized cancer treatments. CMOB is available on GitHub (\url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}).

查看原文本刊更多论文

CMOB：具有开放数据集、任务和基线的大规模癌症多指标基准测试

机器学习在癌症多组学研究领域显示出巨大的潜力，为推进精准医疗提供了难以置信的机遇。然而，与数据集整理和任务制定相关的挑战带来了巨大的障碍，尤其是对于缺乏生物医学背景的研究人员而言。在这里，我们介绍 CMOB，它是第一个集成了 TCGA 平台的大规模癌症多组学基准，使机器学习研究人员无需大量准备工作和专业知识就能获得和使用数据资源。迄今为止，CMOB 包括 20 个癌症多组学数据集，涵盖 32 种癌症，并附有系统的数据处理管道。CMOB 提供了经过良好处理的数据集版本，以支持四项研究中 20 项有意义的任务，并提供了一系列基准。我们还将 CMOB 与两个补充资源和各种生物工具进行了整合，以探索更广泛的研究途径。所有资源都是开放式的，具有用户友好和兼容的整合脚本，使非专业人员也能轻松地将这些补充信息整合到各种任务中。我们在选定的数据集上进行大量实验，为特定应用提供合适的机器学习基线建议。通过 CMOB，我们的目标是促进算法进步，加快个性化癌症治疗机器学习模型的开发、验证和临床转化。CMOB可在GitHub（\url{https://github.com/chenzRG/Cancer-Multi-Omics-Benchmark}）上下载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量