{"title":"上下文熵与文本分类","authors":"Moises Garcia, H. Hidalgo, Edgar Chávez","doi":"10.1109/LA-WEB.2006.11","DOIUrl":null,"url":null,"abstract":"In this paper we describe a new approach to text categorization, our focus is in the amount of information (the entropy) in the text. The entropy is computed with the empirical distribution of words in the text. We provide the system with a manually segmented collection of documents in different categories. For each category a separate empirical distribution of words is computed, we use this empirical distribution for categorization purposes. If we compute the entropy of the test document for each empirical distribution the correct category shows as a maximum. For example, if we compute the entropy of a sports document using the politics or the sports empirical word distributions then the computed entropy is higher in sports than in politics. Our text categorization approach is simple, easy to code and needs no training time (aside from histogram computations). The classification time is linear on the size of the document and the number of document categories. We support our claims with extensive experimentation","PeriodicalId":339667,"journal":{"name":"2006 Fourth Latin American Web Congress","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Contextual Entropy and Text Categorization\",\"authors\":\"Moises Garcia, H. Hidalgo, Edgar Chávez\",\"doi\":\"10.1109/LA-WEB.2006.11\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we describe a new approach to text categorization, our focus is in the amount of information (the entropy) in the text. The entropy is computed with the empirical distribution of words in the text. We provide the system with a manually segmented collection of documents in different categories. For each category a separate empirical distribution of words is computed, we use this empirical distribution for categorization purposes. If we compute the entropy of the test document for each empirical distribution the correct category shows as a maximum. For example, if we compute the entropy of a sports document using the politics or the sports empirical word distributions then the computed entropy is higher in sports than in politics. Our text categorization approach is simple, easy to code and needs no training time (aside from histogram computations). The classification time is linear on the size of the document and the number of document categories. We support our claims with extensive experimentation\",\"PeriodicalId\":339667,\"journal\":{\"name\":\"2006 Fourth Latin American Web Congress\",\"volume\":\"96 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 Fourth Latin American Web Congress\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/LA-WEB.2006.11\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 Fourth Latin American Web Congress","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/LA-WEB.2006.11","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
In this paper we describe a new approach to text categorization, our focus is in the amount of information (the entropy) in the text. The entropy is computed with the empirical distribution of words in the text. We provide the system with a manually segmented collection of documents in different categories. For each category a separate empirical distribution of words is computed, we use this empirical distribution for categorization purposes. If we compute the entropy of the test document for each empirical distribution the correct category shows as a maximum. For example, if we compute the entropy of a sports document using the politics or the sports empirical word distributions then the computed entropy is higher in sports than in politics. Our text categorization approach is simple, easy to code and needs no training time (aside from histogram computations). The classification time is linear on the size of the document and the number of document categories. We support our claims with extensive experimentation