Scalable Text Mining Assisted Curation of Post-Translationally Modified Proteoforms in the Protein Ontology.

CEUR workshop proceedings Pub Date : 2016-08-01 Epub Date: 2016-11-29

Karen E Ross, Darren A Natale, Cecilia Arighi, Sheng-Chih Chen, Hongzhan Huang, Gang Li, Jia Ren, Michael Wang, K Vijay-Shanker, Cathy H Wu

{"title":"Scalable Text Mining Assisted Curation of Post-Translationally Modified Proteoforms in the Protein Ontology.","authors":"Karen E Ross, Darren A Natale, Cecilia Arighi, Sheng-Chih Chen, Hongzhan Huang, Gang Li, Jia Ren, Michael Wang, K Vijay-Shanker, Cathy H Wu","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>The Protein Ontology (PRO) defines protein classes and their interrelationships from the family to the protein form (proteoform) level within and across species. One of the unique contributions of PRO is its representation of post-translationally modified (PTM) proteoforms. However, progress in adding PTM proteoform classes to PRO has been relatively slow due to the extensive manual curation effort required. Here we report an automated pipeline for creation of PTM proteoform classes that leverages two phosphorylation-focused text mining tools (RLIMS-P, which detects mentions of kinases, substrates, and phosphorylation sites, and eFIP, which detects phosphorylation-dependent protein-protein interactions (PPIs)) and our integrated PTM database, iPTMnet. By applying this pipeline, we obtained a set of ~820 substrate-site pairs that are suitable for automated PRO term generation with literature-based evidence attribution. Inclusion of these terms in PRO will increase PRO coverage of species-specific PTM proteoforms by 50%. Many of these new proteoforms also have associated kinase and/or PPI information. Finally, we show a phosphorylation network for the human and mouse peptidyl-prolyl cis-trans isomerase (PIN1/Pin1) derived from our dataset that demonstrates the biological complexity of the information we have extracted. Our approach addresses scalability in PRO curation and will be further expanded to advance PRO representation of phosphorylated proteoforms.</p>","PeriodicalId":72554,"journal":{"name":"CEUR workshop proceedings","volume":"1747 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5504912/pdf/nihms868567.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CEUR workshop proceedings","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2016/11/29 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The Protein Ontology (PRO) defines protein classes and their interrelationships from the family to the protein form (proteoform) level within and across species. One of the unique contributions of PRO is its representation of post-translationally modified (PTM) proteoforms. However, progress in adding PTM proteoform classes to PRO has been relatively slow due to the extensive manual curation effort required. Here we report an automated pipeline for creation of PTM proteoform classes that leverages two phosphorylation-focused text mining tools (RLIMS-P, which detects mentions of kinases, substrates, and phosphorylation sites, and eFIP, which detects phosphorylation-dependent protein-protein interactions (PPIs)) and our integrated PTM database, iPTMnet. By applying this pipeline, we obtained a set of ~820 substrate-site pairs that are suitable for automated PRO term generation with literature-based evidence attribution. Inclusion of these terms in PRO will increase PRO coverage of species-specific PTM proteoforms by 50%. Many of these new proteoforms also have associated kinase and/or PPI information. Finally, we show a phosphorylation network for the human and mouse peptidyl-prolyl cis-trans isomerase (PIN1/Pin1) derived from our dataset that demonstrates the biological complexity of the information we have extracted. Our approach addresses scalability in PRO curation and will be further expanded to advance PRO representation of phosphorylated proteoforms.

Abstract Image

本刊更多论文

可扩展的文本挖掘辅助管理翻译后修改的蛋白质本体中的蛋白质形式。

蛋白质本体论(PRO)定义了蛋白质类及其相互关系，从科到物种内部和物种之间的蛋白质形式(proteoform)水平。PRO的独特贡献之一是它代表了翻译后修饰(PTM)的蛋白质形式。然而，由于需要大量的人工管理工作，将PTM变形类添加到PRO的进展相对缓慢。在这里，我们报告了一个用于创建PTM蛋白质类的自动化管道，该管道利用两个以磷酸化为重点的文本挖掘工具(RLIMS-P，用于检测激酶，底物和磷酸化位点的提及，eFIP，用于检测磷酸化依赖性蛋白质-蛋白质相互作用(PPIs))和我们集成的PTM数据库iPTMnet。通过应用该管道，我们获得了一组约820个底物-位点对，这些底物-位点对适用于基于文献证据归因的PRO术语自动生成。将这些术语纳入PRO将使物种特异性PTM蛋白质形态的PRO覆盖率提高50%。许多这些新的蛋白形式也有相关的激酶和/或PPI信息。最后，我们展示了人类和小鼠肽酰脯氨酸顺式反式异构酶(PIN1/ PIN1)的磷酸化网络，该网络来源于我们的数据集，证明了我们提取的信息的生物复杂性。我们的方法解决了PRO管理的可扩展性，并将进一步扩展到推进磷酸化蛋白形式的PRO表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CEUR workshop proceedings

CiteScore

1.10

自引率

0.00%

发文量