A source coding approach to classification by vector quantization and the principle of minimum description length

Jia Li
{"title":"A source coding approach to classification by vector quantization and the principle of minimum description length","authors":"Jia Li","doi":"10.1109/DCC.2002.999978","DOIUrl":null,"url":null,"abstract":"An algorithm for supervised classification using vector quantization and entropy coding is presented. The classification rule is formed from a set of training data {(X/sub i/, Y/sub i/)}/sub i=1//sup n/, which are independent samples from a joint distribution P/sub XY/. Based on the principle of minimum description length (MDL), a statistical model that approximates the distribution P/sub XY/ ought to enable efficient coding of X and Y. On the other hand, we expect a system that encodes (X, Y) efficiently to provide ample information on the distribution P/sub XY/. This information can then be used to classify X, i.e., to predict the corresponding Y based on X. To encode both X and Y, a two-stage vector quantizer is applied to X and a Huffman code is formed for Y conditioned on each quantized value of X. The optimization of the encoder is equivalent to the design of a vector quantizer with an objective function reflecting the joint penalty of quantization error and misclassification rate. This vector quantizer provides an estimation of the conditional distribution of Y given X, which in turn yields an approximation to the Bayes classification rule. This algorithm, namely discriminant vector quantization (DVQ), is compared with learning vector quantization (LVQ) and CART/sup R/ on a number of data sets. DVQ outperforms the other two on several data sets. The relation between DVQ, density estimation, and regression is also discussed.","PeriodicalId":420897,"journal":{"name":"Proceedings DCC 2002. Data Compression Conference","volume":"55 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings DCC 2002. Data Compression Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DCC.2002.999978","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

An algorithm for supervised classification using vector quantization and entropy coding is presented. The classification rule is formed from a set of training data {(X/sub i/, Y/sub i/)}/sub i=1//sup n/, which are independent samples from a joint distribution P/sub XY/. Based on the principle of minimum description length (MDL), a statistical model that approximates the distribution P/sub XY/ ought to enable efficient coding of X and Y. On the other hand, we expect a system that encodes (X, Y) efficiently to provide ample information on the distribution P/sub XY/. This information can then be used to classify X, i.e., to predict the corresponding Y based on X. To encode both X and Y, a two-stage vector quantizer is applied to X and a Huffman code is formed for Y conditioned on each quantized value of X. The optimization of the encoder is equivalent to the design of a vector quantizer with an objective function reflecting the joint penalty of quantization error and misclassification rate. This vector quantizer provides an estimation of the conditional distribution of Y given X, which in turn yields an approximation to the Bayes classification rule. This algorithm, namely discriminant vector quantization (DVQ), is compared with learning vector quantization (LVQ) and CART/sup R/ on a number of data sets. DVQ outperforms the other two on several data sets. The relation between DVQ, density estimation, and regression is also discussed.
一种基于矢量量化和最小描述长度原则的源编码分类方法
提出了一种基于矢量量化和熵编码的监督分类算法。分类规则由一组训练数据{(X/sub i/, Y/sub i/)}/sub i=1//sup n/组成,这些训练数据是来自联合分布P/sub XY/的独立样本。基于最小描述长度(MDL)原则,一个近似于P/sub XY/分布的统计模型应该能够有效地编码X和Y。另一方面,我们期望一个有效编码(X, Y)的系统能够提供关于P/sub XY/分布的充足信息。然后利用这些信息对X进行分类,即根据X预测相应的Y。对X和Y进行编码时,对X采用两级矢量量化器,并以X的每个量化值为条件对Y形成霍夫曼码。编码器的优化相当于设计一个矢量量化器,其目标函数反映量化误差和误分类率的共同惩罚。这个矢量量化器提供了给定X的Y的条件分布的估计,这反过来又产生了贝叶斯分类规则的近似值。该算法即判别向量量化(discriminant vector quantization, DVQ),在多个数据集上与学习向量量化(learning vector quantization, LVQ)和CART/sup R/进行了比较。DVQ在一些数据集上优于其他两种。本文还讨论了DVQ、密度估计和回归之间的关系。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信