The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository

IF 10.1 1区 生物学 Q1 BIOTECHNOLOGY & APPLIED MICROBIOLOGY
Yu-Ning Huang, Pooja Vinod Jaiswal, Anushka Rajes, Anushka Yadav, Dottie Yu, Fangyun Liu, Grace Scheg, Emma Shih, Grigore Boldirev, Irina Nakashidze, Aditya Sarkar, Jay Himanshu Mehta, Ke Wang, Khooshbu Kantibhai Patel, Mustafa Ali Baig Mirza, Kunali Chetan Hapani, Qiushi Peng, Ram Ayyala, Ruiwei Guo, Shaunak Kapur, Tejasvene Ramesh, Dumitru Ciorbă, Viorel Munteanu, Viorel Bostan, Mihai Dimian, Malak S. Abedalthagafi, Serghei Mangul
{"title":"The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository","authors":"Yu-Ning Huang, Pooja Vinod Jaiswal, Anushka Rajes, Anushka Yadav, Dottie Yu, Fangyun Liu, Grace Scheg, Emma Shih, Grigore Boldirev, Irina Nakashidze, Aditya Sarkar, Jay Himanshu Mehta, Ke Wang, Khooshbu Kantibhai Patel, Mustafa Ali Baig Mirza, Kunali Chetan Hapani, Qiushi Peng, Ram Ayyala, Ruiwei Guo, Shaunak Kapur, Tejasvene Ramesh, Dumitru Ciorbă, Viorel Munteanu, Viorel Bostan, Mihai Dimian, Malak S. Abedalthagafi, Serghei Mangul","doi":"10.1186/s13059-025-03725-0","DOIUrl":null,"url":null,"abstract":"Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata—descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses. Our study assesses the completeness of metadata in over 253 scientific studies, covering more than 164,000 samples from both human and non-human mammalian studies. We find that over 25% of critical metadata are omitted, with only 74.8% of relevant phenotypes available in publications or public repositories. Notably, public repositories alone contain 62% of the phenotypes, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Additionally, studies with non-human samples are more likely to include complete metadata compared to human studies. Similar trends are observed in an extended dataset comprising 61,000 studies and 2.1 million samples from the Gene Expression Omnibus (GEO) data repository. These findings highlight significant gaps in metadata sharing, underscoring the need for standardized practices to improve metadata availability. Enhanced metadata reporting would foster data reusability, support better-informed decision-making, and promote reproducible research across the biomedical field.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"44 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-025-03725-0","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata—descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses. Our study assesses the completeness of metadata in over 253 scientific studies, covering more than 164,000 samples from both human and non-human mammalian studies. We find that over 25% of critical metadata are omitted, with only 74.8% of relevant phenotypes available in publications or public repositories. Notably, public repositories alone contain 62% of the phenotypes, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Additionally, studies with non-human samples are more likely to include complete metadata compared to human studies. Similar trends are observed in an extended dataset comprising 61,000 studies and 2.1 million samples from the Gene Expression Omnibus (GEO) data repository. These findings highlight significant gaps in metadata sharing, underscoring the need for standardized practices to improve metadata availability. Enhanced metadata reporting would foster data reusability, support better-informed decision-making, and promote reproducible research across the biomedical field.
基因表达综合数据库中伴随组学研究的公共元数据完整性的系统评估
高通量测序技术的最新进展使得大量组学数据的收集和共享,以及与之相关的元数据描述信息,包括表型特征和实验设计。提高元数据的可用性对于确保数据的可重用性和再现性以及通过有效的数据重用促进新的生物医学发现至关重要。然而,伴随公共组学数据的不完整元数据可能会阻碍再现性和可重用性,并限制二次分析。我们的研究评估了253项科学研究的元数据完整性,涵盖了来自人类和非人类哺乳动物研究的164,000多个样本。我们发现超过25%的关键元数据被忽略,只有74.8%的相关表型在出版物或公共存储库中可用。值得注意的是,仅公共存储库就包含62%的表型,比出版物的文本内容多3.5%。只有11.5%的研究完全共享所有表型,而37.9%的研究共享不到40%的表型。此外,与人类研究相比,非人类样本的研究更有可能包含完整的元数据。类似的趋势也出现在基因表达综合数据库(Gene Expression Omnibus, GEO)中包含61,000项研究和210万个样本的扩展数据集中。这些发现突出了元数据共享方面的重大差距,强调了提高元数据可用性的标准化实践的必要性。加强元数据报告将促进数据的可重用性,支持更明智的决策,并促进整个生物医学领域的可重复研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Genome Biology
Genome Biology Biochemistry, Genetics and Molecular Biology-Genetics
CiteScore
21.00
自引率
3.30%
发文量
241
审稿时长
2 months
期刊介绍: Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens. With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category. Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信