Holly C Beale, Katrina Learned, Ellen T Kephart, A Geoffrey Lyle, Anouk van den Bout, Molly McCabe, Kathryn Echandia-Monroe, Mansi J Khare, Elise Y Huang, Sneha Jariwala, Reyna Antilla, Allison Cheney, Alex G Lee, Leanne C Sayles, Stanley G Leung, Yvonne A Vasquez, Lauren Sanders, David Haussler, Sofie R Salama, E Alejandro Sweet-Cordero, Olena M Vaske
{"title":"始终如一地处理来自50个来源的RNA测序数据,丰富了儿科数据。","authors":"Holly C Beale, Katrina Learned, Ellen T Kephart, A Geoffrey Lyle, Anouk van den Bout, Molly McCabe, Kathryn Echandia-Monroe, Mansi J Khare, Elise Y Huang, Sneha Jariwala, Reyna Antilla, Allison Cheney, Alex G Lee, Leanne C Sayles, Stanley G Leung, Yvonne A Vasquez, Lauren Sanders, David Haussler, Sofie R Salama, E Alejandro Sweet-Cordero, Olena M Vaske","doi":"10.1038/s41597-025-05376-z","DOIUrl":null,"url":null,"abstract":"<p><p>Larger cohorts improve the power of tumor gene expression analysis, but the signal is muddied if datasets are processed using different methods or have inaccurate metadata. Here we present five compendia containing consistently processed gene expression data derived from 16,446 diverse RNA sequencing datasets. To create the compendia, we obtained access to RNA sequence data from repositories containing public data as well as clinical partners with access to non-published data. We then assessed the quality, quantified gene expression, harmonized clinical metadata, and released the expression values and metadata without access restrictions. These datasets have been used for diverse projects ranging from identifying similarities between tumor types to assessing how well cell lines recapitulate tumors. They have also been used for n-of-1 analysis to identify genes with unusual expression patterns in a single sample and to infer molecular diagnosis. The comparison to new data is enabled by our dockerized, freely available pipeline. The compendia have been cited in at least 20 publications.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"1134"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Consistently processed RNA sequencing data from 50 sources enriched for pediatric data.\",\"authors\":\"Holly C Beale, Katrina Learned, Ellen T Kephart, A Geoffrey Lyle, Anouk van den Bout, Molly McCabe, Kathryn Echandia-Monroe, Mansi J Khare, Elise Y Huang, Sneha Jariwala, Reyna Antilla, Allison Cheney, Alex G Lee, Leanne C Sayles, Stanley G Leung, Yvonne A Vasquez, Lauren Sanders, David Haussler, Sofie R Salama, E Alejandro Sweet-Cordero, Olena M Vaske\",\"doi\":\"10.1038/s41597-025-05376-z\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Larger cohorts improve the power of tumor gene expression analysis, but the signal is muddied if datasets are processed using different methods or have inaccurate metadata. Here we present five compendia containing consistently processed gene expression data derived from 16,446 diverse RNA sequencing datasets. To create the compendia, we obtained access to RNA sequence data from repositories containing public data as well as clinical partners with access to non-published data. We then assessed the quality, quantified gene expression, harmonized clinical metadata, and released the expression values and metadata without access restrictions. These datasets have been used for diverse projects ranging from identifying similarities between tumor types to assessing how well cell lines recapitulate tumors. They have also been used for n-of-1 analysis to identify genes with unusual expression patterns in a single sample and to infer molecular diagnosis. The comparison to new data is enabled by our dockerized, freely available pipeline. The compendia have been cited in at least 20 publications.</p>\",\"PeriodicalId\":21597,\"journal\":{\"name\":\"Scientific Data\",\"volume\":\"12 1\",\"pages\":\"1134\"},\"PeriodicalIF\":5.8000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Scientific Data\",\"FirstCategoryId\":\"103\",\"ListUrlMain\":\"https://doi.org/10.1038/s41597-025-05376-z\",\"RegionNum\":2,\"RegionCategory\":\"综合性期刊\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Data","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41597-025-05376-z","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
Consistently processed RNA sequencing data from 50 sources enriched for pediatric data.
Larger cohorts improve the power of tumor gene expression analysis, but the signal is muddied if datasets are processed using different methods or have inaccurate metadata. Here we present five compendia containing consistently processed gene expression data derived from 16,446 diverse RNA sequencing datasets. To create the compendia, we obtained access to RNA sequence data from repositories containing public data as well as clinical partners with access to non-published data. We then assessed the quality, quantified gene expression, harmonized clinical metadata, and released the expression values and metadata without access restrictions. These datasets have been used for diverse projects ranging from identifying similarities between tumor types to assessing how well cell lines recapitulate tumors. They have also been used for n-of-1 analysis to identify genes with unusual expression patterns in a single sample and to infer molecular diagnosis. The comparison to new data is enabled by our dockerized, freely available pipeline. The compendia have been cited in at least 20 publications.
期刊介绍:
Scientific Data is an open-access journal focused on data, publishing descriptions of research datasets and articles on data sharing across natural sciences, medicine, engineering, and social sciences. Its goal is to enhance the sharing and reuse of scientific data, encourage broader data sharing, and acknowledge those who share their data.
The journal primarily publishes Data Descriptors, which offer detailed descriptions of research datasets, including data collection methods and technical analyses validating data quality. These descriptors aim to facilitate data reuse rather than testing hypotheses or presenting new interpretations, methods, or in-depth analyses.