从生物医学文献中识别基因组数据来源。

AMIA ... Annual Symposium proceedings. AMIA Symposium Pub Date : 2025-05-22 eCollection Date: 2024-01-01
Xu Zuo, Ashley Gilliam, Yan Hu, Kalpana Raja, Kirk Roberts, Hua Xu
{"title":"从生物医学文献中识别基因组数据来源。","authors":"Xu Zuo, Ashley Gilliam, Yan Hu, Kalpana Raja, Kirk Roberts, Hua Xu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.</p>","PeriodicalId":72180,"journal":{"name":"AMIA ... Annual Symposium proceedings. AMIA Symposium","volume":"2024 ","pages":"1350-1359"},"PeriodicalIF":0.0000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099396/pdf/","citationCount":"0","resultStr":"{\"title\":\"Identifying Genomic Data Sources from Biomedical Literature.\",\"authors\":\"Xu Zuo, Ashley Gilliam, Yan Hu, Kalpana Raja, Kirk Roberts, Hua Xu\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.</p>\",\"PeriodicalId\":72180,\"journal\":{\"name\":\"AMIA ... Annual Symposium proceedings. AMIA Symposium\",\"volume\":\"2024 \",\"pages\":\"1350-1359\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-05-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12099396/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AMIA ... Annual Symposium proceedings. AMIA Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AMIA ... Annual Symposium proceedings. AMIA Symposium","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

基因组研究正变得越来越数据密集,但数据的适当参考仍然是一个持续的挑战。尽管建立和规范数据引用实践的各种努力,科学家们经常不能准确地引用他们论文中的数据。这一缺陷使数据提供者的贡献归属变得复杂,并阻碍了基因组研究结果的可重复性。本研究通过引入金标准语料库来解决这一差距,该语料库旨在识别提及的基因组数据源和相关属性,从而提供对数据源可用性和可访问性的见解。在这个语料库中,我们将实体分为六类,包括三个主要实体(Dataset、Repository和Contributor)和三个属性(Accession Number、URL和DOI)。我们还定义和注释了这些主要实体和属性之间的关系。我们通过评估注释者之间的协议和使用基于bert的模型实现信息提取管道,对语料库进行了全面的分析。我们基于bert的模型在识别基因组数据源的提及方面获得了0.94的最佳F1分数,在提取这些提及与相关属性之间的关系方面获得了0.76的最佳F1分数。通过引入该基因组数据源提及语料库,我们旨在推动未来基因组研究中数据共享和重用的进展。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Identifying Genomic Data Sources from Biomedical Literature.

Genomic research is becoming increasingly data-intensive, yet the proper reference of data remains a persistent challenge. Despite various efforts to establish and standardize data citation practices, scientists frequently fall short of accurately referencing data in their papers. This deficiency complicates the attribution of contributions to data providers and impedes the reproducibility of findings in genomic research. This study addresses this gap by introducing a gold standard corpus designed to identify mentions of genomic data sources and associated attributes, thereby offering insights into data source availability and accessibility. Within this corpus, we categorize entities into six classes, encompassing three primary entities (Dataset, Repository, and Contributor) and three attributes (Accession Number, URL, and DOI). We also define and annotate the relations between these main entities and attributes. We perform a comprehensive analysis of the corpus, by assessing inter-annotator agreements and implementing an information extraction pipeline using BERT-based models. Our BERT-based models achieve a best F1 score of 0.94 in recognizing mentions of genomic data sources and 0.76 in extracting relationships between these mentions and associated attributes. By introducing this genomic data source mention corpus, we aim to propel the progress of data sharing and reuse in forthcoming genomic research.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信