将常见的人类疾病与其表型联系起来；人类表型组学资源的开发。

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Biomedical Semantics Pub Date : 2021-08-23 DOI:10.1186/s13326-021-00249-x

Şenay Kafkas, Sara Althubaiti, Georgios V Gkoutos, Robert Hoehndorf, Paul N Schofield

{"title":"将常见的人类疾病与其表型联系起来；人类表型组学资源的开发。","authors":"Şenay Kafkas, Sara Althubaiti, Georgios V Gkoutos, Robert Hoehndorf, Paul N Schofield","doi":"10.1186/s13326-021-00249-x","DOIUrl":null,"url":null,"abstract":"Background: In recent years a large volume of clinical genomics data has become available due to rapid advances in sequencing technologies. Efficient exploitation of this genomics data requires linkage to patient phenotype profiles. Current resources providing disease-phenotype associations are not comprehensive, and they often do not have broad coverage of the disease terminologies, particularly ICD-10, which is still the primary terminology used in clinical settings.Methods: We developed two approaches to gather disease-phenotype associations. First, we used a text mining method that utilizes semantic relations in phenotype ontologies, and applies statistical methods to extract associations between diseases in ICD-10 and phenotype ontology classes from the literature. Second, we developed a semi-automatic way to collect ICD-10-phenotype associations from existing resources containing known relationships.Results: We generated four datasets. Two of them are independent datasets linking diseases to their phenotypes based on text mining and semi-automatic strategies. The remaining two datasets are generated from these datasets and cover a subset of ICD-10 classes of common diseases contained in UK Biobank. We extensively validated our text mined and semi-automatically curated datasets by: comparing them against an expert-curated validation dataset containing disease-phenotype associations, measuring their similarity to disease-phenotype associations found in public databases, and assessing how well they could be used to recover gene-disease associations using phenotype similarity.Conclusion: We find that our text mining method can produce phenotype annotations of diseases that are correct but often too general to have significant information content, or too specific to accurately reflect the typical manifestations of the sporadic disease. On the other hand, the datasets generated from integrating multiple knowledgebases are more complete (i.e., cover more of the required phenotype annotations for a given disease). We make all data freely available at https://doi.org/10.5281/zenodo.4726713 .","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":" ","pages":"17"},"PeriodicalIF":2.0000,"publicationDate":"2021-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8383460/pdf/","citationCount":"0","resultStr":"{\"title\":\"Linking common human diseases to their phenotypes; development of a resource for human phenomics.\",\"authors\":\"Şenay Kafkas, Sara Althubaiti, Georgios V Gkoutos, Robert Hoehndorf, Paul N Schofield\",\"doi\":\"10.1186/s13326-021-00249-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: In recent years a large volume of clinical genomics data has become available due to rapid advances in sequencing technologies. Efficient exploitation of this genomics data requires linkage to patient phenotype profiles. Current resources providing disease-phenotype associations are not comprehensive, and they often do not have broad coverage of the disease terminologies, particularly ICD-10, which is still the primary terminology used in clinical settings.Methods: We developed two approaches to gather disease-phenotype associations. First, we used a text mining method that utilizes semantic relations in phenotype ontologies, and applies statistical methods to extract associations between diseases in ICD-10 and phenotype ontology classes from the literature. Second, we developed a semi-automatic way to collect ICD-10-phenotype associations from existing resources containing known relationships.Results: We generated four datasets. Two of them are independent datasets linking diseases to their phenotypes based on text mining and semi-automatic strategies. The remaining two datasets are generated from these datasets and cover a subset of ICD-10 classes of common diseases contained in UK Biobank. We extensively validated our text mined and semi-automatically curated datasets by: comparing them against an expert-curated validation dataset containing disease-phenotype associations, measuring their similarity to disease-phenotype associations found in public databases, and assessing how well they could be used to recover gene-disease associations using phenotype similarity.Conclusion: We find that our text mining method can produce phenotype annotations of diseases that are correct but often too general to have significant information content, or too specific to accurately reflect the typical manifestations of the sporadic disease. On the other hand, the datasets generated from integrating multiple knowledgebases are more complete (i.e., cover more of the required phenotype annotations for a given disease). We make all data freely available at https://doi.org/10.5281/zenodo.4726713 .\",\"PeriodicalId\":15055,\"journal\":{\"name\":\"Journal of Biomedical Semantics\",\"volume\":\" \",\"pages\":\"17\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2021-08-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8383460/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Biomedical Semantics\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1186/s13326-021-00249-x\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-021-00249-x","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：近年来，由于测序技术的快速发展，大量的临床基因组学数据已经成为可能。有效利用这些基因组学数据需要与患者表型谱相关联。目前提供疾病表型关联的资源并不全面，而且它们往往没有广泛覆盖疾病术语，特别是ICD-10，这仍然是临床环境中使用的主要术语。方法：我们开发了两种方法来收集疾病表型关联。首先，我们使用了一种文本挖掘方法，该方法利用表型本体中的语义关系，并应用统计方法从文献中提取ICD-10中疾病与表型本体类别之间的关联。其次，我们开发了一种半自动方法，从包含已知关系的现有资源中收集icd -10表型关联。结果：我们生成了四个数据集。其中两个是基于文本挖掘和半自动策略将疾病与其表型联系起来的独立数据集。其余两个数据集是从这些数据集生成的，涵盖了英国生物库中包含的ICD-10类常见疾病的一个子集。我们通过以下方式广泛验证了我们的文本挖掘和半自动整理的数据集：将它们与包含疾病-表型关联的专家整理的验证数据集进行比较，测量它们与公共数据库中发现的疾病-表型关联的相似性，并评估它们可以在多大程度上使用表型相似性来恢复基因-疾病关联。结论：我们发现我们的文本挖掘方法可以产生正确的疾病表型注释，但往往过于笼统而没有重要的信息内容，或者过于具体而无法准确反映散发疾病的典型表现。另一方面，整合多个知识库生成的数据集更完整（即，覆盖更多给定疾病所需的表型注释）。我们在https://doi.org/10.5281/zenodo.4726713上免费提供所有数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Linking common human diseases to their phenotypes; development of a resource for human phenomics.

查看原文本刊更多论文

Linking common human diseases to their phenotypes; development of a resource for human phenomics.

Background: In recent years a large volume of clinical genomics data has become available due to rapid advances in sequencing technologies. Efficient exploitation of this genomics data requires linkage to patient phenotype profiles. Current resources providing disease-phenotype associations are not comprehensive, and they often do not have broad coverage of the disease terminologies, particularly ICD-10, which is still the primary terminology used in clinical settings.

Methods: We developed two approaches to gather disease-phenotype associations. First, we used a text mining method that utilizes semantic relations in phenotype ontologies, and applies statistical methods to extract associations between diseases in ICD-10 and phenotype ontology classes from the literature. Second, we developed a semi-automatic way to collect ICD-10-phenotype associations from existing resources containing known relationships.

Results: We generated four datasets. Two of them are independent datasets linking diseases to their phenotypes based on text mining and semi-automatic strategies. The remaining two datasets are generated from these datasets and cover a subset of ICD-10 classes of common diseases contained in UK Biobank. We extensively validated our text mined and semi-automatically curated datasets by: comparing them against an expert-curated validation dataset containing disease-phenotype associations, measuring their similarity to disease-phenotype associations found in public databases, and assessing how well they could be used to recover gene-disease associations using phenotype similarity.

Conclusion: We find that our text mining method can produce phenotype annotations of diseases that are correct but often too general to have significant information content, or too specific to accurately reflect the typical manifestations of the sporadic disease. On the other hand, the datasets generated from integrating multiple knowledgebases are more complete (i.e., cover more of the required phenotype annotations for a given disease). We make all data freely available at https://doi.org/10.5281/zenodo.4726713 .

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.