CUBS: Multivariate Sequence Classification Using Bounded Z-score with Sampling

2010 IEEE International Conference on Data Mining Workshops Pub Date : 2010-12-13 DOI:10.1109/ICDMW.2010.38

A. Richardson, G. Kaminka, Sarit Kraus

{"title":"CUBS: Multivariate Sequence Classification Using Bounded Z-score with Sampling","authors":"A. Richardson, G. Kaminka, Sarit Kraus","doi":"10.1109/ICDMW.2010.38","DOIUrl":null,"url":null,"abstract":"Multivariate temporal sequence classification is an important and challenging task. Several attempts to address this problem exist, but none provide a full solution. In this paper we present CUBS: Classification Using Bounded Z-Score with Sampling. CUBS uses item set mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classification model. We introduce an improved item set mining algorithm that solves the short sequence bias present in many item set mining algorithms. Unfortunately, the z-score normalization hinders pruning. We provide a bound on the z-score to address this issue. Calculation of the z-score normalization requires knowledge of some statistical values of the data gathered using a small sample of the database. The sampling causes a distortion in the values. We analyze this distortion and correct it. We evaluate CUBS for accuracy and scalability on a synthetic dataset and on two real world dataset. The results demonstrate how short subsequence bias is solved in the mining, and show how our bound and sampling technique enable speedup.","PeriodicalId":170201,"journal":{"name":"2010 IEEE International Conference on Data Mining Workshops","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 IEEE International Conference on Data Mining Workshops","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDMW.2010.38","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Multivariate temporal sequence classification is an important and challenging task. Several attempts to address this problem exist, but none provide a full solution. In this paper we present CUBS: Classification Using Bounded Z-Score with Sampling. CUBS uses item set mining to produce frequent subsequences, and then selects among them the statistically significant subsequences to compose a classification model. We introduce an improved item set mining algorithm that solves the short sequence bias present in many item set mining algorithms. Unfortunately, the z-score normalization hinders pruning. We provide a bound on the z-score to address this issue. Calculation of the z-score normalization requires knowledge of some statistical values of the data gathered using a small sample of the database. The sampling causes a distortion in the values. We analyze this distortion and correct it. We evaluate CUBS for accuracy and scalability on a synthetic dataset and on two real world dataset. The results demonstrate how short subsequence bias is solved in the mining, and show how our bound and sampling technique enable speedup.

查看原文本刊更多论文

小熊:多元序列分类使用有界z分数与抽样

多元时间序列分类是一项重要而富有挑战性的任务。解决这个问题的一些尝试已经存在，但是没有一个提供一个完整的解决方案。在本文中，我们提出了小熊:使用有界z分数与抽样的分类。小熊分类算法通过项目集挖掘产生频繁子序列，然后从中选择统计上显著的子序列组成分类模型。提出了一种改进的项目集挖掘算法，解决了许多项目集挖掘算法中存在的短序列偏差。不幸的是，z分数归一化阻碍了修剪。我们提供了z分数的界限来解决这个问题。计算z分数归一化需要了解使用数据库的小样本收集的数据的一些统计值。采样导致值失真。我们分析这种扭曲并加以纠正。我们在一个合成数据集和两个真实数据集上评估了小熊的准确性和可扩展性。结果显示了在挖掘中如何解决短子序列偏差，并显示了我们的定界和采样技术如何实现加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2010 IEEE International Conference on Data Mining Workshops

自引率

0.00%

发文量