通过主题标记进行文档分类

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2013-07-28 DOI:10.1145/2484028.2484140

Swapnil Hingmire, S. Chougule, Girish Keshav Palshikar, Sutanu Chakraborti

{"title":"通过主题标记进行文档分类","authors":"Swapnil Hingmire, S. Chougule, Girish Keshav Palshikar, Sutanu Chakraborti","doi":"10.1145/2484028.2484140","DOIUrl":null,"url":null,"abstract":"In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its \"closeness\" to one of the aggregated topics. We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.","PeriodicalId":178818,"journal":{"name":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","volume":"90 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":"{\"title\":\"Document classification by topic labeling\",\"authors\":\"Swapnil Hingmire, S. Chougule, Girish Keshav Palshikar, Sutanu Chakraborti\",\"doi\":\"10.1145/2484028.2484140\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its \\\"closeness\\\" to one of the aggregated topics. We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.\",\"PeriodicalId\":178818,\"journal\":{\"name\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"90 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"76\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2484028.2484140\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2484028.2484140","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 76

摘要

在本文中，我们提出了基于潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)[1]的文档分类算法，该算法不需要任何标记数据集。在我们的算法中，我们使用LDA构建一个主题模型，将一个主题分配给一个类标签，使用Dirichlet分布的聚合属性将所有相同的类标签主题聚合为一个主题，然后根据其与聚合主题之一的“接近程度”自动为每个未标记的文档分配一个类标签。在期望最大化(EM)算法和朴素贝叶斯分类器的基础上，对该算法进行了扩展。我们在三个真实世界的数据集上展示了算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Document classification by topic labeling

In this paper, we propose Latent Dirichlet Allocation (LDA) [1] based document classification algorithm which does not require any labeled dataset. In our algorithm, we construct a topic model using LDA, assign one topic to one of the class labels, aggregate all the same class label topics into a single topic using the aggregation property of the Dirichlet distribution and then automatically assign a class label to each unlabeled document depending on its "closeness" to one of the aggregated topics. We present an extension to our algorithm based on the combination of Expectation-Maximization (EM) algorithm and a naive Bayes classifier. We show effectiveness of our algorithm on three real world datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量