Semi-Supervised Text Classification Using EM

Semi-Supervised Learning Pub Date : 1900-01-01 DOI:10.7551/mitpress/9780262033589.003.0003

K. Nigam, A. McCallum, Tom Michael Mitchell

{"title":"Semi-Supervised Text Classification Using EM","authors":"K. Nigam, A. McCallum, Tom Michael Mitchell","doi":"10.7551/mitpress/9780262033589.003.0003","DOIUrl":null,"url":null,"abstract":"For several decades, statisticians have advocated using a combination of labeled and unlabeled data to train classifiers by estimating parameters of a generative model through iterative Expectation-Maximization (EM) techniques. This chapter explores the effectiveness of this approach when applied to the domain of text classification. Text documents are represented here with a bag-of-words model, which leads to a generative classification model based on a mixture of multinomials. This model is an extremely simplistic representation of the complexities of written text. This chapter explains and illustrates three key points about semi-supervised learning for text classification with generative models. First, despite the simplistic representation, some text domains have a high positive correlation between generative model probability and classification accuracy. In these domains, a straightforward application of EM with the naive Bayes text model works well. Second, some text domains do not have this correlation. Here we can adopt a more expressive and appropriate generative model that does have a positive correlation. In these domains, semi-supervised learning again improves classification accuracy. Finally, EM suffers from the problem of local maxima, especially in high dimension domains such as text classification. We demonstrate that deterministic annealing, a variant of EM, can help overcome the problem of local maxima and increase classification accuracy further when the generative model is appropriate.","PeriodicalId":345393,"journal":{"name":"Semi-Supervised Learning","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"165","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Semi-Supervised Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.7551/mitpress/9780262033589.003.0003","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 165

Abstract

For several decades, statisticians have advocated using a combination of labeled and unlabeled data to train classifiers by estimating parameters of a generative model through iterative Expectation-Maximization (EM) techniques. This chapter explores the effectiveness of this approach when applied to the domain of text classification. Text documents are represented here with a bag-of-words model, which leads to a generative classification model based on a mixture of multinomials. This model is an extremely simplistic representation of the complexities of written text. This chapter explains and illustrates three key points about semi-supervised learning for text classification with generative models. First, despite the simplistic representation, some text domains have a high positive correlation between generative model probability and classification accuracy. In these domains, a straightforward application of EM with the naive Bayes text model works well. Second, some text domains do not have this correlation. Here we can adopt a more expressive and appropriate generative model that does have a positive correlation. In these domains, semi-supervised learning again improves classification accuracy. Finally, EM suffers from the problem of local maxima, especially in high dimension domains such as text classification. We demonstrate that deterministic annealing, a variant of EM, can help overcome the problem of local maxima and increase classification accuracy further when the generative model is appropriate.

查看原文本刊更多论文

基于EM的半监督文本分类

几十年来，统计学家一直主张使用标记和未标记数据的组合，通过迭代期望最大化(EM)技术估计生成模型的参数来训练分类器。本章探讨了这种方法在文本分类领域的有效性。文本文档在这里用词袋模型表示，这导致了基于多项混合的生成分类模型。这个模型是对书面文本复杂性的极其简单的表示。本章解释和说明了用生成模型进行文本分类的半监督学习的三个关键点。首先，尽管表示过于简单，但一些文本域在生成模型概率和分类精度之间具有很高的正相关关系。在这些领域中，EM与朴素贝叶斯文本模型的直接应用效果很好。其次，一些文本域没有这种相关性。在这里，我们可以采用更具表现力和更合适的生成模型，它确实具有正相关性。在这些领域，半监督学习再次提高了分类的准确性。最后，EM还存在局部极大值的问题，特别是在文本分类等高维领域。我们证明了确定性退火(EM的一种变体)可以帮助克服局部极大值问题，并在生成模型合适的情况下进一步提高分类精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Semi-Supervised Learning

自引率

0.00%

发文量