Yu-Ning Huang, Pooja Vinod Jaiswal, Anushka Rajes, Anushka Yadav, Dottie Yu, Fangyun Liu, Grace Scheg, Emma Shih, Grigore Boldirev, Irina Nakashidze, Aditya Sarkar, Jay Himanshu Mehta, Ke Wang, Khooshbu Kantibhai Patel, Mustafa Ali Baig Mirza, Kunali Chetan Hapani, Qiushi Peng, Ram Ayyala, Ruiwei Guo, Shaunak Kapur, Tejasvene Ramesh, Dumitru Ciorbă, Viorel Munteanu, Viorel Bostan, Mihai Dimian, Malak S. Abedalthagafi, Serghei Mangul
{"title":"基因表达综合数据库中伴随组学研究的公共元数据完整性的系统评估","authors":"Yu-Ning Huang, Pooja Vinod Jaiswal, Anushka Rajes, Anushka Yadav, Dottie Yu, Fangyun Liu, Grace Scheg, Emma Shih, Grigore Boldirev, Irina Nakashidze, Aditya Sarkar, Jay Himanshu Mehta, Ke Wang, Khooshbu Kantibhai Patel, Mustafa Ali Baig Mirza, Kunali Chetan Hapani, Qiushi Peng, Ram Ayyala, Ruiwei Guo, Shaunak Kapur, Tejasvene Ramesh, Dumitru Ciorbă, Viorel Munteanu, Viorel Bostan, Mihai Dimian, Malak S. Abedalthagafi, Serghei Mangul","doi":"10.1186/s13059-025-03725-0","DOIUrl":null,"url":null,"abstract":"Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata—descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses. Our study assesses the completeness of metadata in over 253 scientific studies, covering more than 164,000 samples from both human and non-human mammalian studies. We find that over 25% of critical metadata are omitted, with only 74.8% of relevant phenotypes available in publications or public repositories. Notably, public repositories alone contain 62% of the phenotypes, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Additionally, studies with non-human samples are more likely to include complete metadata compared to human studies. Similar trends are observed in an extended dataset comprising 61,000 studies and 2.1 million samples from the Gene Expression Omnibus (GEO) data repository. These findings highlight significant gaps in metadata sharing, underscoring the need for standardized practices to improve metadata availability. Enhanced metadata reporting would foster data reusability, support better-informed decision-making, and promote reproducible research across the biomedical field.","PeriodicalId":12611,"journal":{"name":"Genome Biology","volume":"44 1","pages":""},"PeriodicalIF":10.1000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository\",\"authors\":\"Yu-Ning Huang, Pooja Vinod Jaiswal, Anushka Rajes, Anushka Yadav, Dottie Yu, Fangyun Liu, Grace Scheg, Emma Shih, Grigore Boldirev, Irina Nakashidze, Aditya Sarkar, Jay Himanshu Mehta, Ke Wang, Khooshbu Kantibhai Patel, Mustafa Ali Baig Mirza, Kunali Chetan Hapani, Qiushi Peng, Ram Ayyala, Ruiwei Guo, Shaunak Kapur, Tejasvene Ramesh, Dumitru Ciorbă, Viorel Munteanu, Viorel Bostan, Mihai Dimian, Malak S. Abedalthagafi, Serghei Mangul\",\"doi\":\"10.1186/s13059-025-03725-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata—descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses. Our study assesses the completeness of metadata in over 253 scientific studies, covering more than 164,000 samples from both human and non-human mammalian studies. We find that over 25% of critical metadata are omitted, with only 74.8% of relevant phenotypes available in publications or public repositories. Notably, public repositories alone contain 62% of the phenotypes, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Additionally, studies with non-human samples are more likely to include complete metadata compared to human studies. Similar trends are observed in an extended dataset comprising 61,000 studies and 2.1 million samples from the Gene Expression Omnibus (GEO) data repository. These findings highlight significant gaps in metadata sharing, underscoring the need for standardized practices to improve metadata availability. Enhanced metadata reporting would foster data reusability, support better-informed decision-making, and promote reproducible research across the biomedical field.\",\"PeriodicalId\":12611,\"journal\":{\"name\":\"Genome Biology\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":10.1000,\"publicationDate\":\"2025-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Genome Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s13059-025-03725-0\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Genome Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13059-025-03725-0","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
The systematic assessment of completeness of public metadata accompanying omics studies in the Gene Expression Omnibus data repository
Recent advances in high-throughput sequencing technologies have enabled the collection and sharing of a massive amount of omics data, along with its associated metadata—descriptive information that contextualizes the data, including phenotypic traits and experimental design. Enhancing metadata availability is critical to ensure data reusability and reproducibility and to facilitate novel biomedical discoveries through effective data reuse. Yet, incomplete metadata accompanying public omics data may hinder reproducibility and reusability and limit secondary analyses. Our study assesses the completeness of metadata in over 253 scientific studies, covering more than 164,000 samples from both human and non-human mammalian studies. We find that over 25% of critical metadata are omitted, with only 74.8% of relevant phenotypes available in publications or public repositories. Notably, public repositories alone contain 62% of the phenotypes, surpassing the textual content of publications by 3.5%. Only 11.5% of studies completely shared all phenotypes, while 37.9% shared less than 40% of the phenotypes. Additionally, studies with non-human samples are more likely to include complete metadata compared to human studies. Similar trends are observed in an extended dataset comprising 61,000 studies and 2.1 million samples from the Gene Expression Omnibus (GEO) data repository. These findings highlight significant gaps in metadata sharing, underscoring the need for standardized practices to improve metadata availability. Enhanced metadata reporting would foster data reusability, support better-informed decision-making, and promote reproducible research across the biomedical field.
Genome BiologyBiochemistry, Genetics and Molecular Biology-Genetics
CiteScore
21.00
自引率
3.30%
发文量
241
审稿时长
2 months
期刊介绍:
Genome Biology stands as a premier platform for exceptional research across all domains of biology and biomedicine, explored through a genomic and post-genomic lens.
With an impressive impact factor of 12.3 (2022),* the journal secures its position as the 3rd-ranked research journal in the Genetics and Heredity category and the 2nd-ranked research journal in the Biotechnology and Applied Microbiology category by Thomson Reuters. Notably, Genome Biology holds the distinction of being the highest-ranked open-access journal in this category.
Our dedicated team of highly trained in-house Editors collaborates closely with our esteemed Editorial Board of international experts, ensuring the journal remains on the forefront of scientific advances and community standards. Regular engagement with researchers at conferences and institute visits underscores our commitment to staying abreast of the latest developments in the field.