From sBoW to dCoT marginalized encoders for text representation

Proceedings of the 21st ACM international conference on Information and knowledge management Pub Date : 2012-10-29 DOI:10.1145/2396761.2398536

Z. Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha

{"title":"From sBoW to dCoT marginalized encoders for text representation","authors":"Z. Xu, Minmin Chen, Kilian Q. Weinberger, Fei Sha","doi":"10.1145/2396761.2398536","DOIUrl":null,"url":null,"abstract":"In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.","PeriodicalId":313414,"journal":{"name":"Proceedings of the 21st ACM international conference on Information and knowledge management","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21st ACM international conference on Information and knowledge management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2396761.2398536","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 26

Abstract

In text mining, information retrieval, and machine learning, text documents are commonly represented through variants of sparse Bag of Words (sBoW) vectors (e.g. TF-IDF [1]). Although simple and intuitive, sBoW style representations suffer from their inherent over-sparsity and fail to capture word-level synonymy and polysemy. Especially when labeled data is limited (e.g. in document classification), or the text documents are short (e.g. emails or abstracts), many features are rarely observed within the training corpus. This leads to overfitting and reduced generalization accuracy. In this paper we propose Dense Cohort of Terms (dCoT), an unsupervised algorithm to learn improved sBoW document features. dCoT explicitly models absent words by removing and reconstructing random sub-sets of words in the unlabeled corpus. With this approach, dCoT learns to reconstruct frequent words from co-occurring infrequent words and maps the high dimensional sparse sBoW vectors into a low-dimensional dense representation. We show that the feature removal can be marginalized out and that the reconstruction can be solved for in closed-form. We demonstrate empirically, on several benchmark datasets, that dCoT features significantly improve the classification accuracy across several document classification tasks.

查看原文本刊更多论文

从sBoW到dCoT，用于文本表示的边缘编码器

在文本挖掘、信息检索和机器学习中，文本文档通常通过稀疏词袋(sBoW)向量的变体来表示(例如TF-IDF[1])。虽然简单直观，但sBoW风格表示存在固有的过度稀疏性，无法捕获词级同义和多义。特别是当标记数据有限(例如在文档分类中)，或者文本文档很短(例如电子邮件或摘要)时，在训练语料库中很少观察到许多特征。这会导致过拟合和降低泛化精度。在本文中，我们提出了密集的术语队列(dCoT)，一种无监督的算法来学习改进的sBoW文档特征。dCoT通过删除和重建未标记语料库中的随机词子集来显式地建模缺失词。通过这种方法，dCoT学习从同时出现的不频繁单词中重构频繁单词，并将高维稀疏sBoW向量映射为低维密集表示。我们证明了特征去除可以被边缘化，重构可以以封闭形式求解。我们在几个基准数据集上通过经验证明，dCoT特征显著提高了多个文档分类任务的分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 21st ACM international conference on Information and knowledge management

自引率

0.00%

发文量