集成了基于实例和类的文本分类生成建模

Australasian Document Computing Symposium Pub Date : 2013-12-05 DOI:10.1145/2537734.2537751

Antti Puurula, Sung-Hyon Myaeng

{"title":"集成了基于实例和类的文本分类生成建模","authors":"Antti Puurula, Sung-Hyon Myaeng","doi":"10.1145/2537734.2537751","DOIUrl":null,"url":null,"abstract":"Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.","PeriodicalId":402985,"journal":{"name":"Australasian Document Computing Symposium","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Integrated instance- and class-based generative modeling for text classification\",\"authors\":\"Antti Puurula, Sung-Hyon Myaeng\",\"doi\":\"10.1145/2537734.2537751\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.\",\"PeriodicalId\":402985,\"journal\":{\"name\":\"Australasian Document Computing Symposium\",\"volume\":\"6 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-12-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Australasian Document Computing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2537734.2537751\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Australasian Document Computing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2537734.2537751","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

文本分类的统计方法主要基于基于类的学习范式，将类变量与特征相关联，在模型训练后丢弃数据实例。这将产生高效的模型，但忽略了单个文档中存在的细粒度信息。基于实例的学习使用这些信息，但在文本数据方面存在数据稀疏性问题。在本文中，我们提出了一种称为绑定文档混合(TDM)的生成模型，用于用层次平滑模型的混合扩展多项朴素贝叶斯(MNB)。另外，TDM可以看作是使用类平滑多项式核的核密度分类器。TDM在14个不同的数据集上对多标签、多类和二类文本分类任务的分类精度进行了评估，并与基于实例和基于类的学习基线进行了比较。与MNB的比较表明，作为每类可用训练文档的函数，准确率有了实质性的提高，情感分类的平均误差减少了26%以上，垃圾邮件分类的平均误差减少了65%。平均而言，TDM与最佳判别分类器一样准确，但保留了基于实例的学习方法的线性时间复杂性，并具有用于模型估计和推理的精确算法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Integrated instance- and class-based generative modeling for text classification

Statistical methods for text classification are predominantly based on the paradigm of class-based learning that associates class variables with features, discarding the instances of data after model training. This results in efficient models, but neglects the fine-grained information present in individual documents. Instance-based learning uses this information, but suffers from data sparsity with text data. In this paper, we propose a generative model called Tied Document Mixture (TDM) for extending Multinomial Naive Bayes (MNB) with mixtures of hierarchically smoothed models for documents. Alternatively, TDM can be viewed as a Kernel Density Classifier using class-smoothed Multinomial kernels. TDM is evaluated for classification accuracy on 14 different datasets for multi-label, multi-class and binary-class text classification tasks and compared to instance- and class-based learning baselines. The comparisons to MNB demonstrate a substantial improvement in accuracy as a function of available training documents per class, ranging up to average error reductions of over 26% in sentiment classification and 65% in spam classification. On average TDM is as accurate as the best discriminative classifiers, but retains the linear time complexities of instance-based learning methods, with exact algorithms for both model estimation and inference.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Australasian Document Computing Symposium

自引率

0.00%

发文量