Shengxin Zha, Xujun Peng, Huaigu Cao, Xiaodan Zhuang, P. Natarajan, P. Natarajan
{"title":"基于向量特征表示的文本分类","authors":"Shengxin Zha, Xujun Peng, Huaigu Cao, Xiaodan Zhuang, P. Natarajan, P. Natarajan","doi":"10.1109/DAS.2014.10","DOIUrl":null,"url":null,"abstract":"In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.","PeriodicalId":220495,"journal":{"name":"2014 11th IAPR International Workshop on Document Analysis Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Text Classification via iVector Based Feature Representation\",\"authors\":\"Shengxin Zha, Xujun Peng, Huaigu Cao, Xiaodan Zhuang, P. Natarajan, P. Natarajan\",\"doi\":\"10.1109/DAS.2014.10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.\",\"PeriodicalId\":220495,\"journal\":{\"name\":\"2014 11th IAPR International Workshop on Document Analysis Systems\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 11th IAPR International Workshop on Document Analysis Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DAS.2014.10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 11th IAPR International Workshop on Document Analysis Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DAS.2014.10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Text Classification via iVector Based Feature Representation
In this paper, we address the problem of text classification: classifying modern machine-printed text, handwritten text and historical typewritten text from degraded noisy documents. We propose a novel text classification approach based on iVector, a newly developed concept in speaker verification. To a given text line, the iVector is a fixed-length feature vector representation, transformed from a high-dimensional super vector based on means of Gaussian mixture model (GMM), where the text dependent component is separated from a universal background model (UBM) and can be represented by a low dimensional set of factors. We classify the text lines with a discriminative classifier - support vector machine (SVM) in iVector space. A baseline approach of text classification using GMM in feature space is also presented for evaluation purpose. Experimental results on an Arabic document database show accuracy of 92.04% for text line classification using the proposed method. Furthermore, the relative word error rate (WER) of 9.6% is decreased in optical character recognition (OCR) when coupled with the proposed iVector-SVM classifier. The proposed iVector-SVM approach is language independent, thus, can be applied to other scripts as well.