基于向量特征表示的文本分类

2014 11th IAPR International Workshop on Document Analysis Systems Pub Date : 2014-04-01 DOI:10.1109/DAS.2014.10

Shengxin Zha, Xujun Peng, Huaigu Cao, Xiaodan Zhuang, P. Natarajan, P. Natarajan

{"title":"基于向量特征表示的文本分类","authors":"Shengxin Zha, Xujun Peng, Huaigu Cao, Xiaodan Zhuang, P. Natarajan, P. Natarajan","doi":"10.1109/DAS.2014.10","DOIUrl":null,"url":null,"abstract":"In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.","PeriodicalId":220495,"journal":{"name":"2014 11th IAPR International Workshop on Document Analysis Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Text Classification via iVector Based Feature Representation\",\"authors\":\"Shengxin Zha, Xujun Peng, Huaigu Cao, Xiaodan Zhuang, P. Natarajan, P. Natarajan\",\"doi\":\"10.1109/DAS.2014.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.\",\"PeriodicalId\":220495,\"journal\":{\"name\":\"2014 11th IAPR International Workshop on Document Analysis Systems\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 11th IAPR International Workshop on Document Analysis Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2014.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th IAPR International Workshop on Document Analysis Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2014.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

在本文中，我们解决了文本分类问题:从退化的噪声文档中对现代机器打印文本、手写文本和历史打字文本进行分类。本文提出了一种基于向量的文本分类方法。对于给定的文本行，向量是由基于高斯混合模型(GMM)均值的高维超向量转换而来的定长特征向量表示，其中文本依赖分量与通用背景模型(UBM)分离，可以由低维因子集表示。我们在向量空间中使用判别分类器—支持向量机(SVM)对文本行进行分类。本文还提出了一种在特征空间中使用GMM进行文本分类的基线方法。在一个阿拉伯语文档数据库上的实验结果表明，该方法对文本行分类的准确率达到92.04%。此外，与所提出的向量-支持向量机分类器相结合，光学字符识别(OCR)的相对单词错误率(WER)降低了9.6%。所提出的向量-支持向量机方法与语言无关，因此也可以应用于其他脚本。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Text Classification via iVector Based Feature Representation

In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 11th IAPR International Workshop on Document Analysis Systems

自引率

0.00%

发文量