Integrated instance- and class-based generative modeling for text classification

Australasian Document Computing Symposium Pub Date : 2013-12-05 DOI:10.1145/2537734.2537751

Antti Puurula, Sung-Hyon Myaeng

{"title":"Integrated instance- and class-based generative modeling for text classification","authors":"Antti Puurula, Sung-Hyon Myaeng","doi":"10.1145/2537734.2537751","DOIUrl":null,"url":null,"abstract":"Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2537734.2537751","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.

查看原文本刊更多论文

集成了基于实例和类的文本分类生成建模

文本分类的统计方法主要基于基于类的学习范式，将类变量与特征相关联，在模型训练后丢弃数据实例。这将产生高效的模型，但忽略了单个文档中存在的细粒度信息。基于实例的学习使用这些信息，但在文本数据方面存在数据稀疏性问题。在本文中，我们提出了一种称为绑定文档混合(TDM)的生成模型，用于用层次平滑模型的混合扩展多项朴素贝叶斯(MNB)。另外，TDM可以看作是使用类平滑多项式核的核密度分类器。TDM在14个不同的数据集上对多标签、多类和二类文本分类任务的分类精度进行了评估，并与基于实例和基于类的学习基线进行了比较。与MNB的比较表明，作为每类可用训练文档的函数，准确率有了实质性的提高，情感分类的平均误差减少了26%以上，垃圾邮件分类的平均误差减少了65%。平均而言，TDM与最佳判别分类器一样准确，但保留了基于实例的学习方法的线性时间复杂性，并具有用于模型估计和推理的精确算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Australasian Document Computing Symposium

自引率

0.00%

发文量