Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers.

Heather Piwowar, Wendy Chapman
{"title":"Recall and bias of retrieving gene expression microarray datasets through PubMed identifiers.","authors":"Heather Piwowar,&nbsp;Wendy Chapman","doi":"","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles.</p><p><strong>Results: </strong>In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to only 76.6% found by our search carried out using PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. There was no difference in the type of datasets found by PubMed identifier searches in terms of research theme or the technology used. However, the studies identified were more likely to have larger sample sizes, were more frequently cited, and published in higher impact journals.</p><p><strong>Conclusions: </strong>Searching database entries using PubMed identifiers can identify the majority of publicly available datasets, but caution is required when this method is used to collect data for policy evaluation since studies in low impact journals are disproportionately excluded. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.</p>","PeriodicalId":87404,"journal":{"name":"Journal of biomedical discovery and collaboration","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2010-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2990274/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of biomedical discovery and collaboration","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: The ability to locate publicly available gene expression microarray datasets effectively and efficiently facilitates the reuse of these potentially valuable resources. Centralized biomedical databases allow users to query dataset metadata descriptions, but these annotations are often too sparse and diverse to allow complex and accurate queries. In this study we examined the ability of PubMed article identifiers to locate publicly available gene expression microarray datasets, and investigated whether the retrieved datasets were representative of publicly available datasets found through statements of data sharing in the associated research articles.

Results: In a recent article, Ochsner and colleagues identified 397 studies that had generated gene expression microarray data. Their search of the full text of each publication for statements of data sharing revealed 203 publicly available datasets, including 179 in the Gene Expression Omnibus (GEO) or ArrayExpress databases. Our scripted search of GEO and ArrayExpress for PubMed identifiers of the same 397 studies returned 160 datasets, including six not found by the original search for data sharing statements. As a proportion of datasets found by either method, the search for data sharing statements identified 91.4% of the 209 publicly available datasets, compared to only 76.6% found by our search carried out using PubMed identifiers. Searching GEO or ArrayExpress alone retrieved 63.2% and 46.9% of all available datasets, respectively. There was no difference in the type of datasets found by PubMed identifier searches in terms of research theme or the technology used. However, the studies identified were more likely to have larger sample sizes, were more frequently cited, and published in higher impact journals.

Conclusions: Searching database entries using PubMed identifiers can identify the majority of publicly available datasets, but caution is required when this method is used to collect data for policy evaluation since studies in low impact journals are disproportionately excluded. We urge authors of all datasets to complete the citation fields for their dataset submissions once publication details are known, thereby ensuring their work has maximum visibility and can contribute to subsequent studies.

Abstract Image

Abstract Image

Abstract Image

通过PubMed标识符检索基因表达微阵列数据集的召回率和偏差。
背景:有效和高效地定位公开可用的基因表达微阵列数据集的能力促进了这些潜在有价值资源的再利用。集中式生物医学数据库允许用户查询数据集元数据描述,但这些注释通常过于稀疏和多样化,无法实现复杂和准确的查询。在这项研究中,我们检查了PubMed文章标识符定位公开可用的基因表达微阵列数据集的能力,并调查了检索到的数据集是否代表了通过相关研究文章的数据共享声明发现的公开可用数据集。结果:在最近的一篇文章中,Ochsner及其同事发现了397项产生基因表达微阵列数据的研究。他们对每篇出版物的数据共享声明全文进行了搜索,发现了203个公开可用的数据集,其中179个在Gene Expression Omnibus (GEO)或ArrayExpress数据库中。我们用GEO和ArrayExpress编写脚本搜索相同的397项研究的PubMed标识符,返回160个数据集,其中包括6个数据共享语句的原始搜索未找到的数据集。作为通过任何一种方法找到的数据集的比例,对数据共享声明的搜索确定了209个公开可用数据集的91.4%,相比之下,使用PubMed标识符进行的搜索仅发现76.6%。单独搜索GEO或ArrayExpress分别检索到所有可用数据集的63.2%和46.9%。在研究主题或使用的技术方面,通过PubMed标识符搜索发现的数据集类型没有差异。然而,被确定的研究更有可能有更大的样本量,更频繁地被引用,并在更有影响力的期刊上发表。结论:使用PubMed标识符搜索数据库条目可以识别大多数公开可用的数据集,但当使用这种方法收集政策评估数据时需要谨慎,因为低影响力期刊的研究被不成比例地排除在外。我们敦促所有数据集的作者在了解出版细节后完成数据集提交的引文字段,从而确保他们的工作具有最大的可见性,并可以为后续研究做出贡献。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信