{"title":"Generalized Sequence Signatures through Symbolic Clustering","authors":"D. Dorr, A. Denton","doi":"10.1109/ICMLA.2007.41","DOIUrl":null,"url":null,"abstract":"Traditionally sequence motifs and domains, also called signatures, are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of generalized sequence signatures that can be composed of windows distributed throughout the sequence. We use an approach that is based on clustering analysis of recurring subsequences, to which we refer as symbols, of a predefined length. Symbols are not required to be located in close proximity to each other. The clustering algorithm group sequences so as to maximize the number of shared symbols among sequences. We evaluate our signatures in comparison to those obtained from the InterPro database, and show that our approach has benefits for deriving sequence annotations compared with InterPro's signatures.","PeriodicalId":448863,"journal":{"name":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sixth International Conference on Machine Learning and Applications (ICMLA 2007)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2007.41","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Traditionally sequence motifs and domains, also called signatures, are defined such that insertions, deletions and mismatched regions are small compared with matched regions. We introduce an algorithm for the identification of generalized sequence signatures that can be composed of windows distributed throughout the sequence. We use an approach that is based on clustering analysis of recurring subsequences, to which we refer as symbols, of a predefined length. Symbols are not required to be located in close proximity to each other. The clustering algorithm group sequences so as to maximize the number of shared symbols among sequences. We evaluate our signatures in comparison to those obtained from the InterPro database, and show that our approach has benefits for deriving sequence annotations compared with InterPro's signatures.