Classification Systems for Bacterial Protein-Protein Interaction Document Retrieval
Hongfang Liu, Manabu Torii, Guixian Xu, Johannes Goll
{"title":"Classification Systems for Bacterial Protein-Protein Interaction Document Retrieval","authors":"Hongfang Liu, Manabu Torii, Guixian Xu, Johannes Goll","doi":"10.4018/jcmam.2010072003","DOIUrl":null,"url":null,"abstract":"Protein-protein interaction (PPI) networks are essential to understand the fundamental processes governing cell biology. Recently, studying PPI networks becomes possible due to advances in experimental high-throughput genomics and proteomics technologies. Many interactions from such high-throughput studies and most interactions from small-scale studies are reported only in the scientific literature and thus are not accessible in a readily analyzable format. This has led to the birth of manual curation initiatives such as the International Molecular Exchange Consortium (IMEx). The manual curation of PPI knowledge can be accelerated by text mining systems to retrieve PPI-relevant articles (article retrieval) and extract PPI-relevant knowledge (information extraction). In this article, the authors focus on article retrieval and define the task as binary classification where PPI-relevant articles are positives and the others are negatives. In order to build such classifier, an annotated corpus is needed. It is very expensive to obtain an annotated corpus manually but a noisy and imbalanced annotated corpus can be obtained automatically, where a collection of positive documents can be retrieved from existing PPI knowledge bases and a large number of unlabeled documents (most of them are negatives) can be retrieved from PubMed. They compared the performance of several machine learning algorithms by varying the ratio of the number of positives to the number of unlabeled documents and the number of features used. DOI: 10.4018/jcmam.2010072003 IGI PUBLISHING This paper appears in the publication, International Journal of Computational Models and Algorithms in Medicine, Volume 1, Issue 1 edited by Aryya Gangopadhyay © 2010, IGI Global 701 E. Choc late Avenue, Hersh y PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-global.com ITJ 5528 International Journal of Computational Models and Algorithms in Medicine, 1(1), 34-44, January-March 2010 35 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. ized way and to avoid duplication of efforts, IMEx1 databases such as IntAct (http://www. ebi.ac.uk/intact), DIP (Database of Interacting Proteins; http://dip.doe-mbi.ucla.edu), MINT (Molecular Interactions Database; http://mint. bio.uniroma2.it/mint) and MPIDB (Microbial Protein Interaction Database; http://www. jcvi.org/mpidb) conduct coordinated manual literature curation. Text mining system to prioritize articles for curators according to their PPI relevance can accelerate such curation processes significantly. For example, MPIDB curators scan a whole issue (20 to 50 articles) of the Journal of Bacteriology or Molecular Microbiology and find approximately 10% of these articles report interaction experiments. Thus, the curators spend roughly 90% of their time reading irrelevant articles. A text mining system to prioritize articles for curators can be developed using supervised classification algorithms that provide certain kinds of confidence scores during classification. In order to build such systems, a class-labeled corpus is needed where PPI-relevant documents are labeled as positive and those irrelevant as negative. In many real-world applications, it is common that positive instances are explicitly included in a designated database, but it is uncommon to also include negatives in the database (Elkan & Noto, 2008). In developing a PPI mining application, PPI-relevant documents can be retrieved from existing PPI knowledge bases and unlabeled documents are available in large literature repositories such as PubMed. Learning with only positively labeled documents has great importance in this application. We consider learning with only positive labeled documents as learning from a noisy and imbalanced training set where unlabeled documents are considered as negatives with some mislabeled documents. We build a document retrieval system to assist the curation of MPIDB (Goll et al., 2008) and report our investigation of the stability of two document classification algorithms with respect to the ratio of positives and unlabeled documents in the training set and also of the impact of feature selection on the classification performance. We also propose to use different subsets of unlabeled documents and form an ensemble of classifiers. In the following, we first describe the background of classification algorithms. The experimental methods are introduced next. We then present the results and discussion, and conclude our work.","PeriodicalId":162417,"journal":{"name":"Int. J. Comput. Model. Algorithms Medicine","volume":"71 3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Comput. Model. Algorithms Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4018/jcmam.2010072003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Protein-protein interaction (PPI) networks are essential to understand the fundamental processes governing cell biology. Recently, studying PPI networks becomes possible due to advances in experimental high-throughput genomics and proteomics technologies. Many interactions from such high-throughput studies and most interactions from small-scale studies are reported only in the scientific literature and thus are not accessible in a readily analyzable format. This has led to the birth of manual curation initiatives such as the International Molecular Exchange Consortium (IMEx). The manual curation of PPI knowledge can be accelerated by text mining systems to retrieve PPI-relevant articles (article retrieval) and extract PPI-relevant knowledge (information extraction). In this article, the authors focus on article retrieval and define the task as binary classification where PPI-relevant articles are positives and the others are negatives. In order to build such classifier, an annotated corpus is needed. It is very expensive to obtain an annotated corpus manually but a noisy and imbalanced annotated corpus can be obtained automatically, where a collection of positive documents can be retrieved from existing PPI knowledge bases and a large number of unlabeled documents (most of them are negatives) can be retrieved from PubMed. They compared the performance of several machine learning algorithms by varying the ratio of the number of positives to the number of unlabeled documents and the number of features used. DOI: 10.4018/jcmam.2010072003 IGI PUBLISHING This paper appears in the publication, International Journal of Computational Models and Algorithms in Medicine, Volume 1, Issue 1 edited by Aryya Gangopadhyay © 2010, IGI Global 701 E. Choc late Avenue, Hersh y PA 17033-1240, USA Tel: 717/533-8845; Fax 717/533-8661; URL-http://www.igi-global.com ITJ 5528 International Journal of Computational Models and Algorithms in Medicine, 1(1), 34-44, January-March 2010 35 Copyright © 2010, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited. ized way and to avoid duplication of efforts, IMEx1 databases such as IntAct (http://www. ebi.ac.uk/intact), DIP (Database of Interacting Proteins; http://dip.doe-mbi.ucla.edu), MINT (Molecular Interactions Database; http://mint. bio.uniroma2.it/mint) and MPIDB (Microbial Protein Interaction Database; http://www. jcvi.org/mpidb) conduct coordinated manual literature curation. Text mining system to prioritize articles for curators according to their PPI relevance can accelerate such curation processes significantly. For example, MPIDB curators scan a whole issue (20 to 50 articles) of the Journal of Bacteriology or Molecular Microbiology and find approximately 10% of these articles report interaction experiments. Thus, the curators spend roughly 90% of their time reading irrelevant articles. A text mining system to prioritize articles for curators can be developed using supervised classification algorithms that provide certain kinds of confidence scores during classification. In order to build such systems, a class-labeled corpus is needed where PPI-relevant documents are labeled as positive and those irrelevant as negative. In many real-world applications, it is common that positive instances are explicitly included in a designated database, but it is uncommon to also include negatives in the database (Elkan & Noto, 2008). In developing a PPI mining application, PPI-relevant documents can be retrieved from existing PPI knowledge bases and unlabeled documents are available in large literature repositories such as PubMed. Learning with only positively labeled documents has great importance in this application. We consider learning with only positive labeled documents as learning from a noisy and imbalanced training set where unlabeled documents are considered as negatives with some mislabeled documents. We build a document retrieval system to assist the curation of MPIDB (Goll et al., 2008) and report our investigation of the stability of two document classification algorithms with respect to the ratio of positives and unlabeled documents in the training set and also of the impact of feature selection on the classification performance. We also propose to use different subsets of unlabeled documents and form an ensemble of classifiers. In the following, we first describe the background of classification algorithms. The experimental methods are introduced next. We then present the results and discussion, and conclude our work.
细菌蛋白质-蛋白质相互作用文献检索的分类系统
蛋白质-蛋白质相互作用(PPI)网络对于理解控制细胞生物学的基本过程至关重要。最近,由于实验高通量基因组学和蛋白质组学技术的进步,研究PPI网络成为可能。来自此类高通量研究的许多相互作用和来自小规模研究的大多数相互作用仅在科学文献中报道,因此无法以易于分析的形式获得。这导致了人工管理倡议的诞生,如国际分子交换联盟(IMEx)。文本挖掘系统可以通过检索PPI相关文章(文章检索)和提取PPI相关知识(信息提取)来加速PPI知识的人工管理。在本文中,作者将重点放在文章检索上,并将任务定义为二元分类,其中与ppi相关的文章是正面的,而其他文章是负面的。为了构建这样的分类器,需要一个带注释的语料库。人工获取带标注的语料库是非常昂贵的,但可以自动获得带有噪声和不平衡的带标注语料库,其中可以从现有的PPI知识库中检索到一组正面文档,并且可以从PubMed中检索到大量未标记的文档(其中大多数是负面的)。他们通过改变阳性数量与未标记文档数量的比例以及使用的特征数量,比较了几种机器学习算法的性能。DOI: 10.4018 / jcmam.2010072003IGI PUBLISHING本文发表于《International Journal of Computational Models and Algorithms in Medicine》第1卷第1期,由Aryya Gangopadhyay编辑©2010,IGI Global 701 E. Choc late Avenue, Hersh y PA 17033-1240, USA Tel: 717/533-8845;传真717/533 - 8661;ITJ 5528 International Journal of Computational Models and Algorithms in Medicine, 1(1), 34-44, January-March 2010版权所有©2010,IGI Global。未经IGI Global书面许可,禁止以印刷或电子形式复制或分发。方法和避免重复工作,IMEx1数据库,如完好无损(http://www。ebi.ac.uk/intact), DIP(相互作用蛋白数据库;http://dip.doe-mbi.ucla.edu), MINT(分子相互作用数据库;http://mint。bio.uniroma2.it/mint)和MPIDB(微生物蛋白相互作用数据库;http://www。Jcvi.org/mpidb)进行协调的手册文献管理。文本挖掘系统根据文章的PPI相关性对文章进行优先级排序,可以显著加快这类策展过程。例如,MPIDB管理员扫描《细菌学杂志》或《分子微生物学》的整期(20到50篇文章),发现其中大约10%的文章报告了相互作用实验。因此,策展人花了大约90%的时间阅读无关的文章。可以使用监督分类算法开发一个文本挖掘系统,为管理员确定文章的优先级,该算法在分类过程中提供某些类型的置信度分数。为了构建这样的系统,需要一个类标记的语料库,其中与ppi相关的文档被标记为积极的,而那些不相关的文档被标记为消极的。在许多现实世界的应用程序中,积极实例被明确地包含在指定的数据库中是很常见的,但在数据库中也包括消极实例是不常见的(Elkan & Noto, 2008)。在开发PPI挖掘应用程序时,可以从现有的PPI知识库中检索与PPI相关的文档,并且可以在PubMed等大型文献存储库中获得未标记的文档。在此应用程序中,仅使用正标记文档进行学习非常重要。我们将只使用正标记文档的学习视为从一个嘈杂和不平衡的训练集中学习,其中未标记文档被视为带有一些错误标记文档的负标记文档。我们建立了一个文档检索系统来协助MPIDB的管理(Goll等人,2008),并报告了我们对两种文档分类算法在训练集中阳性和未标记文档的比例方面的稳定性以及特征选择对分类性能的影响的调查。我们还建议使用未标记文档的不同子集,形成一个分类器集合。下面,我们首先介绍分类算法的背景。接下来介绍了实验方法。然后,我们介绍结果和讨论,并总结我们的工作。
本文章由计算机程序翻译,如有差异,请以英文原文为准。