上下文熵与文本分类

2006 Fourth Latin American Web Congress Pub Date : 2006-10-25 DOI:10.1109/LA-WEB.2006.11

Moises Garcia, H. Hidalgo, Edgar Chávez

{"title":"上下文熵与文本分类","authors":"Moises Garcia, H. Hidalgo, Edgar Chávez","doi":"10.1109/LA-WEB.2006.11","DOIUrl":null,"url":null,"abstract":"In this paper we describe a new approach to text categorization, our focus is in the amount of information (the entropy) in the text. The entropy is computed with the empirical distribution of words in the text. We provide the system with a manually segmented collection of documents in different categories. For each category a separate empirical distribution of words is computed, we use this empirical distribution for categorization purposes. If we compute the entropy of the test document for each empirical distribution the correct category shows as a maximum. For example, if we compute the entropy of a sports document using the politics or the sports empirical word distributions then the computed entropy is higher in sports than in politics. Our text categorization approach is simple, easy to code and needs no training time (aside from histogram computations). The classification time is linear on the size of the document and the number of document categories. We support our claims with extensive experimentation","PeriodicalId":339667,"journal":{"name":"2006 Fourth Latin American Web Congress","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Contextual Entropy and Text Categorization\",\"authors\":\"Moises Garcia, H. Hidalgo, Edgar Chávez\",\"doi\":\"10.1109/LA-WEB.2006.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we describe a new approach to text categorization, our focus is in the amount of information (the entropy) in the text. The entropy is computed with the empirical distribution of words in the text. We provide the system with a manually segmented collection of documents in different categories. For each category a separate empirical distribution of words is computed, we use this empirical distribution for categorization purposes. If we compute the entropy of the test document for each empirical distribution the correct category shows as a maximum. For example, if we compute the entropy of a sports document using the politics or the sports empirical word distributions then the computed entropy is higher in sports than in politics. Our text categorization approach is simple, easy to code and needs no training time (aside from histogram computations). The classification time is linear on the size of the document and the number of document categories. We support our claims with extensive experimentation\",\"PeriodicalId\":339667,\"journal\":{\"name\":\"2006 Fourth Latin American Web Congress\",\"volume\":\"96 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 Fourth Latin American Web Congress\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/LA-WEB.2006.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 Fourth Latin American Web Congress","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LA-WEB.2006.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

在本文中，我们描述了一种新的文本分类方法，我们的重点是文本的信息量(熵)。根据文本中单词的经验分布计算熵。我们为系统提供了手动分割的不同类别的文档集合。对于每个类别，计算一个单独的经验分布的词，我们使用这种经验分布的分类目的。如果我们计算每个经验分布的测试文档的熵，正确的类别显示为最大值。例如，如果我们使用政治或体育经验词分布来计算体育文档的熵，那么计算出的体育文档的熵要高于政治文档。我们的文本分类方法简单，易于编码，不需要训练时间(除了直方图计算)。分类时间与文档的大小和文档类别的数量成线性关系。我们通过广泛的实验来支持我们的主张

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Contextual Entropy and Text Categorization

In this paper we describe a new approach to text categorization, our focus is in the amount of information (the entropy) in the text. The entropy is computed with the empirical distribution of words in the text. We provide the system with a manually segmented collection of documents in different categories. For each category a separate empirical distribution of words is computed, we use this empirical distribution for categorization purposes. If we compute the entropy of the test document for each empirical distribution the correct category shows as a maximum. For example, if we compute the entropy of a sports document using the politics or the sports empirical word distributions then the computed entropy is higher in sports than in politics. Our text categorization approach is simple, easy to code and needs no training time (aside from histogram computations). The classification time is linear on the size of the document and the number of document categories. We support our claims with extensive experimentation

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2006 Fourth Latin American Web Congress

自引率

0.00%

发文量