{"title":"Assembly and reasoning over semantic mappings at scale for biomedical data integration.","authors":"Charles Tapley Hoyt, Klas Karis, Benjamin M Gyori","doi":"10.1093/bioinformatics/btaf542","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Hundreds of resources assign identifiers to biomedical concepts including genes, small molecules, biological processes, diseases, and cell types. Often, these resources overlap by assigning identifiers to the same or related concepts. This creates a data interoperability bottleneck, as integrating data sets and knowledge bases that use identifiers for the same concepts from different resources requires such identifiers to be mapped to each other. However, available mappings are incomplete and fragmented across individual resources, motivating their large-scale integration.</p><p><strong>Results: </strong>We developed SeMRA, a software tool that integrates mappings from multiple sources into a graph data structure. Using graph algorithms, it infers missing mappings implied by available ones while keeping track of provenance and confidence. This allows connecting identifier spaces between which direct mapping was previously not possible. SeMRA implements a customizable workflow that takes a declarative specification as input describing sources to integrate with additional configuration parameters. We used SeMRA to produce the SeMRA Raw Mappings Database, an aggregation of 43.4 million mappings from 127 sources that jointly cover identifiers from 445 ontologies and databases. We also describe benchmarks on specific use cases such as integrating mappings between resources cataloging diseases and cell types.</p><p><strong>Availability: </strong>The code is available under the MIT license at https://github.com/biopragmatics/semra. The SeMRA Raw Mappings Database assembled by SeMRA is available at https://doi.org/10.5281/zenodo.11082038.</p><p><strong>Supplementary information: </strong>Supplementary data are available at Bioinformatics online.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Motivation: Hundreds of resources assign identifiers to biomedical concepts including genes, small molecules, biological processes, diseases, and cell types. Often, these resources overlap by assigning identifiers to the same or related concepts. This creates a data interoperability bottleneck, as integrating data sets and knowledge bases that use identifiers for the same concepts from different resources requires such identifiers to be mapped to each other. However, available mappings are incomplete and fragmented across individual resources, motivating their large-scale integration.
Results: We developed SeMRA, a software tool that integrates mappings from multiple sources into a graph data structure. Using graph algorithms, it infers missing mappings implied by available ones while keeping track of provenance and confidence. This allows connecting identifier spaces between which direct mapping was previously not possible. SeMRA implements a customizable workflow that takes a declarative specification as input describing sources to integrate with additional configuration parameters. We used SeMRA to produce the SeMRA Raw Mappings Database, an aggregation of 43.4 million mappings from 127 sources that jointly cover identifiers from 445 ontologies and databases. We also describe benchmarks on specific use cases such as integrating mappings between resources cataloging diseases and cell types.
Availability: The code is available under the MIT license at https://github.com/biopragmatics/semra. The SeMRA Raw Mappings Database assembled by SeMRA is available at https://doi.org/10.5281/zenodo.11082038.
Supplementary information: Supplementary data are available at Bioinformatics online.