{"title":"Text mining for pharmacogenomics","authors":"R. Altman","doi":"10.1145/1458449.1458451","DOIUrl":null,"url":null,"abstract":"We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation:\n 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/\n 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these.\n 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data and Text Mining in Bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1458449.1458451","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation:
1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/
2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these.
3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.