A source coding approach to classification by vector quantization and the principle of minimum description length

Proceedings DCC 2002. Data Compression Conference Pub Date : 2002-04-02 DOI:10.1109/DCC.2002.999978

Jia Li

{"title":"A source coding approach to classification by vector quantization and the principle of minimum description length","authors":"Jia Li","doi":"10.1109/DCC.2002.999978","DOIUrl":null,"url":null,"abstract":"An algorithm for supervised classification using vector quantization and entropy coding is presented. The classification rule is formed from a set of training data {(X/sub i/, Y/sub i/)}/sub i=1//sup n/, which are independent samples from a joint distribution P/sub XY/. Based on the principle of minimum description length (MDL), a statistical model that approximates the distribution P/sub XY/ ought to enable efficient coding of X and Y. On the other hand, we expect a system that encodes (X, Y) efficiently to provide ample information on the distribution P/sub XY/. This information can then be used to classify X, i.e., to predict the corresponding Y based on X. To encode both X and Y, a two-stage vector quantizer is applied to X and a Huffman code is formed for Y conditioned on each quantized value of X. The optimization of the encoder is equivalent to the design of a vector quantizer with an objective function reflecting the joint penalty of quantization error and misclassification rate. This vector quantizer provides an estimation of the conditional distribution of Y given X, which in turn yields an approximation to the Bayes classification rule. This algorithm, namely discriminant vector quantization (DVQ), is compared with learning vector quantization (LVQ) and CART/sup R/ on a number of data sets. DVQ outperforms the other two on several data sets. The relation between DVQ, density estimation, and regression is also discussed.","PeriodicalId":420897,"journal":{"name":"Proceedings DCC 2002. Data Compression Conference","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC 2002. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2002.999978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

An algorithm for supervised classification using vector quantization and entropy coding is presented. The classification rule is formed from a set of training data {(X/sub i/, Y/sub i/)}/sub i=1//sup n/, which are independent samples from a joint distribution P/sub XY/. Based on the principle of minimum description length (MDL), a statistical model that approximates the distribution P/sub XY/ ought to enable efficient coding of X and Y. On the other hand, we expect a system that encodes (X, Y) efficiently to provide ample information on the distribution P/sub XY/. This information can then be used to classify X, i.e., to predict the corresponding Y based on X. To encode both X and Y, a two-stage vector quantizer is applied to X and a Huffman code is formed for Y conditioned on each quantized value of X. The optimization of the encoder is equivalent to the design of a vector quantizer with an objective function reflecting the joint penalty of quantization error and misclassification rate. This vector quantizer provides an estimation of the conditional distribution of Y given X, which in turn yields an approximation to the Bayes classification rule. This algorithm, namely discriminant vector quantization (DVQ), is compared with learning vector quantization (LVQ) and CART/sup R/ on a number of data sets. DVQ outperforms the other two on several data sets. The relation between DVQ, density estimation, and regression is also discussed.

查看原文本刊更多论文

一种基于矢量量化和最小描述长度原则的源编码分类方法

提出了一种基于矢量量化和熵编码的监督分类算法。分类规则由一组训练数据{(X/sub i/， Y/sub i/)}/sub i=1//sup n/组成，这些训练数据是来自联合分布P/sub XY/的独立样本。基于最小描述长度(MDL)原则，一个近似于P/sub XY/分布的统计模型应该能够有效地编码X和Y。另一方面，我们期望一个有效编码(X, Y)的系统能够提供关于P/sub XY/分布的充足信息。然后利用这些信息对X进行分类，即根据X预测相应的Y。对X和Y进行编码时，对X采用两级矢量量化器，并以X的每个量化值为条件对Y形成霍夫曼码。编码器的优化相当于设计一个矢量量化器，其目标函数反映量化误差和误分类率的共同惩罚。这个矢量量化器提供了给定X的Y的条件分布的估计，这反过来又产生了贝叶斯分类规则的近似值。该算法即判别向量量化(discriminant vector quantization, DVQ)，在多个数据集上与学习向量量化(learning vector quantization, LVQ)和CART/sup R/进行了比较。DVQ在一些数据集上优于其他两种。本文还讨论了DVQ、密度估计和回归之间的关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings DCC 2002. Data Compression Conference

自引率

0.00%

发文量