Text mining for pharmacogenomics

Data and Text Mining in Bioinformatics Pub Date : 2008-10-30 DOI:10.1145/1458449.1458451

R. Altman

{"title":"Text mining for pharmacogenomics","authors":"R. Altman","doi":"10.1145/1458449.1458451","DOIUrl":null,"url":null,"abstract":"We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation:\n 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/\n 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these.\n 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"6302 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and Text Mining in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1458449.1458451","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation: 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/ 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these. 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.

查看原文本刊更多论文

药物基因组学的文本挖掘

我们正在建立药物遗传学和药物基因组学知识库(PharmGKB, http://www.pharmgkb.org/)，目标是对所有关于遗传变异如何影响药物反应表型的知识进行编目。PharmGKB存储了原始数据(基因型和表型数据)以及更多以途径图、非常重要的药物基因(VIP基因)的注释摘要和注释文献的形式提炼出来的知识。文献注释的工作包括由训练有素的管理员手动整理和自动信息提取。在这次演讲中，我将讨论与我们在文献策展方面的努力相关的三个项目:pharmespresso项目是一个简单的基于规则的系统，用于从文本中提取提到的基因、药物、疾病和多态性相互作用。它以加州理工学院开发的Textpresso系统为基础，但增加了关于人类药物、基因和表型的具体规则。最初版本的pharmespresso具有良好的性能，但存在假阳性提取的问题，因此我们一直在努力提高性能，同时尽可能保持通用性。pharmespresso可在http://www.pharmpresso.stanford.edu/2上找到。PGxPipeline项目建立在人工和自动挖掘基因-药物-疾病关联的基础上，以进行科学发现。药物遗传学的一个关键瓶颈是确定可能对改变药物反应很重要的基因。除非了解药物作用和代谢的全部细节，否则大约25,000个人类基因中的任何一个都可能对了解作用和代谢很重要。PgxPipeline可以接受药物和使用适应症(例如疼痛或高胆固醇)作为输入。然后，它使用文献信息和化学结构信息对人类基因组中所有基因与感兴趣的药物相互作用的可能性进行排序。通过这种方式，我们可以优先考虑最可能与药物相关的基因。我们发现，我们的排名顺序列表是其他独立信息来源的有用辅助，并且与这些信息结合使用效果最好。3.最后，我们一直在研究蛋白质中结合小分子(如药物)或作为蛋白质功能发生的重要活性位点的位置。我们根据结构相似性对这些位点进行聚类，以发现与蛋白质功能相关的新结构基序。通常，我们对这些新发现的结构基序的功能一无所知，但文献中通常有关于这些基序所属蛋白质功能的大量信息。因此，我们的最终项目集中于收集与具有共同基序的蛋白质相关的文献，并确定哪些单词/概念可能描述这些蛋白质的共同功能，从而确定这些共享结构基序的可能意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data and Text Mining in Bioinformatics

自引率

0.00%

发文量