A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications.

IF 11.8 2区 生物学 Q1 MULTIDISCIPLINARY SCIENCES
Jerven Bolleman, Vincent Emonet, Adrian Altenhoff, Amos Bairoch, Marie-Claude Blatter, Alan Bridge, Séverine Duvaud, Elisabeth Gasteiger, Dmitry Kuznetsov, Sébastien Moretti, Pierre-Andre Michel, Anne Morgat, Marco Pagni, Nicole Redaschi, Monique Zahn-Zabal, Tarcisio Mendes de Farias, Ana Claudia Sima
{"title":"A large collection of bioinformatics question-query pairs over federated knowledge graphs: methodology and applications.","authors":"Jerven Bolleman, Vincent Emonet, Adrian Altenhoff, Amos Bairoch, Marie-Claude Blatter, Alan Bridge, Séverine Duvaud, Elisabeth Gasteiger, Dmitry Kuznetsov, Sébastien Moretti, Pierre-Andre Michel, Anne Morgat, Marco Pagni, Nicole Redaschi, Monique Zahn-Zabal, Tarcisio Mendes de Farias, Ana Claudia Sima","doi":"10.1093/gigascience/giaf045","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>In recent decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning (for example, machine-learning algorithms for translating natural language questions to SPARQL), if a sufficiently large number of examples are provided and published in a common, machine-readable, and standardized format across different resources.</p><p><strong>Findings: </strong>We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1,000 example questions and queries, including almost 100 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.</p><p><strong>Conclusions: </strong>We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. URL:  https://github.com/sib-swiss/sparql-examples.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12083453/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf045","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Background: In recent decades, several life science resources have structured data using the same framework and made these accessible using the same query language to facilitate interoperability. Knowledge graphs have seen increased adoption in bioinformatics due to their advantages for representing data in a generic graph format. For example, yummydata.org catalogs more than 60 knowledge graphs accessible through SPARQL, a technical query language. Although SPARQL allows powerful, expressive queries, even across physically distributed knowledge graphs, formulating such queries is a challenge for most users. Therefore, to guide users in retrieving the relevant data, many of these resources provide representative examples. These examples can also be an important source of information for machine learning (for example, machine-learning algorithms for translating natural language questions to SPARQL), if a sufficiently large number of examples are provided and published in a common, machine-readable, and standardized format across different resources.

Findings: We introduce a large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1,000 example questions and queries, including almost 100 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.

Conclusions: We encourage the community to adopt and extend the proposed methodology, towards richer KG metadata and improved Semantic Web services. URL:  https://github.com/sib-swiss/sparql-examples.

联邦知识图上的生物信息学问题-查询对的大集合:方法和应用。
背景:近几十年来,一些生命科学资源使用相同的框架对数据进行结构化,并使用相同的查询语言对这些数据进行访问,以促进互操作性。知识图谱在生物信息学领域的应用越来越广泛,因为它们具有以通用图形格式表示数据的优势。例如,yummydata.org编目了60多个知识图,可以通过SPARQL(一种技术查询语言)访问。尽管SPARQL允许强大的、富有表现力的查询,甚至可以跨物理分布的知识图进行查询,但是对大多数用户来说,制定这样的查询是一个挑战。因此,为了指导用户检索相关数据,这些资源中的许多都提供了具有代表性的示例。这些示例也可以成为机器学习的重要信息源(例如,将自然语言问题翻译为SPARQL的机器学习算法),如果在不同的资源中以通用的、机器可读的和标准化的格式提供和发布足够数量的示例。研究结果:我们在瑞士生物信息学研究所(SIB Swiss Institute of bioinformatics)的不同研究小组收集的联合生物信息学知识图(KGs)上引入了大量人类编写的自然语言问题及其相应的SPARQL查询。该集合包含1000多个示例问题和查询,其中包括近100个联邦查询。我们提出了一种基于现有标准的方法,以最小的元数据统一表示示例。此外,我们还介绍了一组广泛的开源应用程序,包括查询图形可视化和智能查询编辑器,这些应用程序可以被采用所提出方法的KG维护者轻松重用。结论:我们鼓励社区采用并扩展所建议的方法,以实现更丰富的KG元数据和改进的语义Web服务。URL: https://github.com/sib-swiss/sparql-examples。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
GigaScience
GigaScience MULTIDISCIPLINARY SCIENCES-
CiteScore
15.50
自引率
1.10%
发文量
119
审稿时长
1 weeks
期刊介绍: GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信