Segment and Combine Approach for Biological Sequence Classification

2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology Pub Date : 1900-01-01 DOI:10.1109/CIBCB.2005.1594917

P. Geurts, Antia Blanco Cuesta, L. Wehenkel

引用次数: 7

Abstract

This paper presents a new algorithm based on the segment and combine paradigm, for automatic classification of biological sequences. It classifies sequences by aggregating the information about their subsequences predicted by a classifier derived by machine learning from a random sample of training subsequences. This generic approach is combined with decision tree based ensemble methods, scalable both with respect to sample size and vocabulary size. The method is applied to three families of problems: DNA sequence recognition, splice junction detection, and gene regulon prediction. With respect to standard approaches based on n-grams, it appears competitive in terms of accuracy, flexibility, and scalability. The paper also highlights the possibility to exploit the resulting models to identify interpretable patterns specific of a given class of biological sequences.

查看原文本刊更多论文

生物序列分段与组合分类方法

提出了一种基于片段和组合范式的生物序列自动分类算法。它通过从训练子序列的随机样本中通过机器学习获得的分类器预测的子序列信息来对序列进行分类。这种通用方法与基于决策树的集成方法相结合，在样本量和词汇量方面都具有可扩展性。该方法应用于三大类问题:DNA序列识别、剪接连接检测和基因调控预测。相对于基于n-gram的标准方法，它在准确性、灵活性和可伸缩性方面具有竞争力。本文还强调了利用所得模型来识别特定于某一类生物序列的可解释模式的可能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology

自引率

0.00%

发文量