释放PubMed Central补充数据文件的潜力。

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Bioinformatics advances Pub Date : 2025-06-27 eCollection Date: 2025-01-01 DOI:10.1093/bioadv/vbaf155

Julien Gobeill, Déborah Caucheteur, Alexandre Flament, Pierre-André Michel, Anaïs Mottaz, Emilie Pasche, Patrick Ruch

{"title":"释放PubMed Central补充数据文件的潜力。","authors":"Julien Gobeill, Déborah Caucheteur, Alexandre Flament, Pierre-André Michel, Anaïs Mottaz, Emilie Pasche, Patrick Ruch","doi":"10.1093/bioadv/vbaf155","DOIUrl":null,"url":null,"abstract":"Motivation: Biocuration workflows often rely on comprehensive literature searches for specific biological entities. However, standard search engines such as MEDLINE and PubMed Central provide an incomplete picture of the scientific literature because they do not index the increasing amount of valuable information published in supplementary data files. Over two years, we addressed this gap by systematically extracting text from a large proportion (85%) of these files, resulting in 35 million searchable documents. To assess the information gain provided by supplementary data files beyond the manuscripts, we searched both for mentions of dozens of Global Core Biodata Resources (GCBRs), which are fundamental biological databases essential for the life sciences. We searched for mentions of GCBR names and accession numbers, which uniquely identify biological entities within these resources.Results: The recall gain from using the supplementary data files to search for articles mentioning resource names is 6%. In addition, 97% of all accession numbers identified were published in the supplementary data files, highlighting their increasing importance for highly specific topics or curation pipelines. We show that the number of accession numbers published in the supplementary data files is increasing year on year, but that 87% of these are published in Excel files. This format facilitates human readability and accessibility, but severely limits machine reusability and interoperability. We therefore discuss alternative and complementary approaches to the publication of research data.Availability and implementation: All extracted data are accessible and searchable as a collection on the BiodiversityPMC platform (https://biodiversitypmc.sibils.org/).","PeriodicalId":72368,"journal":{"name":"Bioinformatics advances","volume":"5 1","pages":"vbaf155"},"PeriodicalIF":2.8000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371329/pdf/","citationCount":"0","resultStr":"{\"title\":\"Unlocking the potential of PubMed Central supplementary data files.\",\"authors\":\"Julien Gobeill, Déborah Caucheteur, Alexandre Flament, Pierre-André Michel, Anaïs Mottaz, Emilie Pasche, Patrick Ruch\",\"doi\":\"10.1093/bioadv/vbaf155\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Motivation: Biocuration workflows often rely on comprehensive literature searches for specific biological entities. However, standard search engines such as MEDLINE and PubMed Central provide an incomplete picture of the scientific literature because they do not index the increasing amount of valuable information published in supplementary data files. Over two years, we addressed this gap by systematically extracting text from a large proportion (85%) of these files, resulting in 35 million searchable documents. To assess the information gain provided by supplementary data files beyond the manuscripts, we searched both for mentions of dozens of Global Core Biodata Resources (GCBRs), which are fundamental biological databases essential for the life sciences. We searched for mentions of GCBR names and accession numbers, which uniquely identify biological entities within these resources.Results: The recall gain from using the supplementary data files to search for articles mentioning resource names is 6%. In addition, 97% of all accession numbers identified were published in the supplementary data files, highlighting their increasing importance for highly specific topics or curation pipelines. We show that the number of accession numbers published in the supplementary data files is increasing year on year, but that 87% of these are published in Excel files. This format facilitates human readability and accessibility, but severely limits machine reusability and interoperability. We therefore discuss alternative and complementary approaches to the publication of research data.Availability and implementation: All extracted data are accessible and searchable as a collection on the BiodiversityPMC platform (https://biodiversitypmc.sibils.org/).\",\"PeriodicalId\":72368,\"journal\":{\"name\":\"Bioinformatics advances\",\"volume\":\"5 1\",\"pages\":\"vbaf155\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371329/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics advances\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/bioadv/vbaf155\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioadv/vbaf155","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

动机：生物定位工作流程通常依赖于对特定生物实体的全面文献搜索。然而，像MEDLINE和PubMed Central这样的标准搜索引擎提供的科学文献的信息并不完整，因为它们没有对补充数据文件中发表的越来越多的有价值的信息进行索引。在两年多的时间里，我们通过系统地从这些文件的很大一部分（85%）中提取文本来解决这一差距，从而产生了3500万个可搜索的文档。为了评估稿件之外的补充数据文件所提供的信息增益，我们检索了数十个全球核心生物数据资源（GCBRs），这是生命科学必不可少的基础生物数据库。我们搜索了提到的GCBR名称和加入号，它们唯一地标识了这些资源中的生物实体。结果：使用补充数据文件检索含有资源名称的文章，查全率为6%。此外，确定的所有加入号中有97%在补充数据文件中发布，突出了它们对高度特定主题或管理管道的重要性日益增加。我们发现，在补充数据文件中发布的加入号数量逐年增加，但其中87%以Excel文件发布。这种格式促进了人类的可读性和可访问性，但严重限制了机器的可重用性和互操作性。因此，我们讨论了发表研究数据的替代和补充方法。可用性和实施：所有提取的数据都可以在BiodiversityPMC平台（https://biodiversitypmc.sibils.org/）上作为一个集合进行访问和搜索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Unlocking the potential of PubMed Central supplementary data files.

查看原文本刊更多论文

Unlocking the potential of PubMed Central supplementary data files.

Motivation: Biocuration workflows often rely on comprehensive literature searches for specific biological entities. However, standard search engines such as MEDLINE and PubMed Central provide an incomplete picture of the scientific literature because they do not index the increasing amount of valuable information published in supplementary data files. Over two years, we addressed this gap by systematically extracting text from a large proportion (85%) of these files, resulting in 35 million searchable documents. To assess the information gain provided by supplementary data files beyond the manuscripts, we searched both for mentions of dozens of Global Core Biodata Resources (GCBRs), which are fundamental biological databases essential for the life sciences. We searched for mentions of GCBR names and accession numbers, which uniquely identify biological entities within these resources.

Results: The recall gain from using the supplementary data files to search for articles mentioning resource names is 6%. In addition, 97% of all accession numbers identified were published in the supplementary data files, highlighting their increasing importance for highly specific topics or curation pipelines. We show that the number of accession numbers published in the supplementary data files is increasing year on year, but that 87% of these are published in Excel files. This format facilitates human readability and accessibility, but severely limits machine reusability and interoperability. We therefore discuss alternative and complementary approaches to the publication of research data.

Availability and implementation: All extracted data are accessible and searchable as a collection on the BiodiversityPMC platform (https://biodiversitypmc.sibils.org/).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Bioinformatics advances

CiteScore

1.60

自引率

0.00%

发文量