可扩展的、可更新的序列数据预测模型

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) Pub Date : 2010-12-01 DOI:10.1109/BIBM.2010.5706652

Neeraj Koul, N. Bui, Vasant G Honavar

{"title":"可扩展的、可更新的序列数据预测模型","authors":"Neeraj Koul, N. Bui, Vasant G Honavar","doi":"10.1109/BIBM.2010.5706652","DOIUrl":null,"url":null,"abstract":"The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"108 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Scalable, updatable predictive models for sequence data\",\"authors\":\"Neeraj Koul, N. Bui, Vasant G Honavar\",\"doi\":\"10.1109/BIBM.2010.5706652\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.\",\"PeriodicalId\":275098,\"journal\":{\"name\":\"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"volume\":\"108 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BIBM.2010.5706652\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BIBM.2010.5706652","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

数据丰富领域的出现导致数据存储库的规模和数量呈指数级增长，为使用机器学习算法从数据中学习提供了令人兴奋的机会。特别是，序列数据正以迅速的速度提供。在许多应用中，由于大量数据大小或带宽限制等各种原因，学习算法可能无法直接访问整个数据集。在这种情况下，需要能够在不直接访问数据的情况下从大型数据集中学习预测模型(例如分类器)的技术。我们描述了一种使用统计查询从大量序列数据集中学习的方法。具体来说，我们展示了如何从仅回答一类计数查询的序列数据库构建马尔可夫模型和概率后缀树(pst)。我们分析了在这种设置中构造分类器所需的查询复杂性(所需查询数量的度量)，并概述了一些最小化查询复杂性的技术。我们还展示了如何更新一些模型以响应底层序列数据库中序列子集的添加或删除。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalable, updatable predictive models for sequence data

The emergence of data rich domains has led to an exponential growth in the size and number of data repositories, offering exciting opportunities to learn from the data using machine learning algorithms. In particular, sequence data is being made available at a rapid rate. In many applications, the learning algorithm may not have direct access to the entire dataset because of a variety of reasons such as massive data size or bandwidth limitation. In such settings, there is a need for techniques that can learn predictive models (e.g., classifiers) from large datasets without direct access to the data. We describe an approach to learn from massive sequence datasets using statistical queries. Specifically we show how Markov Models and Probabilistic Suffix Trees (PSTs) can be constructed from sequence databases that answer only a class of count queries. We analyze the query complexity (a measure of the number of queries needed) for constructing classifiers in such settings and outline some techniques to minimize the query complexity. We also show how some of the models can be updated in response to addition or deletion of subsets of sequences from the underlying sequence database.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

自引率

0.00%

发文量