{"title":"Protein annotators' assistant: A novel application of information retrieval techniques","authors":"M. Wise","doi":"10.1002/1097-4571(2000)9999:9999%3C::AID-ASI1020%3E3.0.CO;2-F","DOIUrl":null,"url":null,"abstract":"The Protein Annotators' Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backward from SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt database, and against each protein are listed key words and phrases that are extracted from the corresponding text records. Common words either in general English usage or from the biological domain are removed as the phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list of stop‐words (i.e., reject words), together with a list of accept‐words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein names and clusters the named proteins around keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the categories are specified in advance.","PeriodicalId":50013,"journal":{"name":"Journal of the American Society for Information Science and Technology","volume":"31 1","pages":"1131-1136"},"PeriodicalIF":0.0000,"publicationDate":"2000-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Society for Information Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/1097-4571(2000)9999:9999%3C::AID-ASI1020%3E3.0.CO;2-F","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
The Protein Annotators' Assistant (or PAA) (http://www.ebi.ac.uk/paa/) is a software system which assists protein annotators in the task of assigning functions to newly sequenced proteins. Working backward from SwissProt, a database which describes known proteins, and a prior sequence similarity search that returns a list of known proteins similar to a query, PAA suggests keywords and phrases which may describe functions performed by the query. In a preprocessing step, a database is built from the protein names that appear in the SwissProt database, and against each protein are listed key words and phrases that are extracted from the corresponding text records. Common words either in general English usage or from the biological domain are removed as the phrases are assembled. This process is assisted by the use of a simple stemming algorithm, which extends the list of stop‐words (i.e., reject words), together with a list of accept‐words. At runtime, the search algorithm, invoked by a user via a Web interface, takes a list of protein names and clusters the named proteins around keywords/phrases shared by members of the list. The assumption is that if these proteins have a particular keyword/phrase in common, and they are related to a query protein, then the keyword/phrase may also describe the query. Overall, PAA employs a number of IR techniques in a novel setting and is thus related to text categorization, where multiple categories may be suggested, except that in this case none of the categories are specified in advance.
Protein Annotators' Assistant(或PAA) (http://www.ebi.ac.uk/paa/)是一个软件系统,它可以帮助蛋白质注释者为新测序的蛋白质分配功能。PAA从SwissProt(一个描述已知蛋白质的数据库)和返回与查询相似的已知蛋白质列表的先前序列相似性搜索向后工作,建议可以描述查询执行的功能的关键字和短语。在预处理步骤中,根据出现在SwissProt数据库中的蛋白质名称建立数据库,并针对每个蛋白质列出从相应文本记录中提取的关键词和短语。在短语的组装过程中,无论是一般英语用法中的常用词还是来自生物领域的常用词都会被删除。该过程通过使用简单的词干提取算法来辅助,该算法扩展了停止词列表(即拒绝词)以及接受词列表。在运行时,由用户通过Web界面调用的搜索算法获取一个蛋白质名称列表,并将命名的蛋白质聚集在列表成员共享的关键字/短语周围。假设这些蛋白质有一个共同的关键字/短语,并且它们与查询蛋白质相关,那么关键字/短语也可以描述查询。总的来说,PAA在新的设置中使用了许多IR技术,因此与文本分类有关,其中可能建议使用多个类别,只是在这种情况下没有预先指定任何类别。