{"title":"基于特定词汇的分类","authors":"J. Savoy, Olena Zubaryeva","doi":"10.1109/WI-IAT.2011.19","DOIUrl":null,"url":null,"abstract":"Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient categorization scheme. To evaluate this proposition we categorize speeches given by B. Obama as either electoral or presidential. The results tend to show that the suggested classification scheme performs better than a Support Vector Machine scheme, and a Naive Bayes classifier (10-fold cross validation).","PeriodicalId":128421,"journal":{"name":"2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Classification Based on Specific Vocabulary\",\"authors\":\"J. Savoy, Olena Zubaryeva\",\"doi\":\"10.1109/WI-IAT.2011.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient categorization scheme. To evaluate this proposition we categorize speeches given by B. Obama as either electoral or presidential. The results tend to show that the suggested classification scheme performs better than a Support Vector Machine scheme, and a Naive Bayes classifier (10-fold cross validation).\",\"PeriodicalId\":128421,\"journal\":{\"name\":\"2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology\",\"volume\":\"69 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-08-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI-IAT.2011.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT.2011.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Assuming a binomial distribution for word occurrence, we propose computing a standardized Z score to define the specific vocabulary of a subset compared to that of the entire corpus. This approach is applied to weight terms characterizing a document (or a sample of texts). We then show how these Z score values can be used to derive an efficient categorization scheme. To evaluate this proposition we categorize speeches given by B. Obama as either electoral or presidential. The results tend to show that the suggested classification scheme performs better than a Support Vector Machine scheme, and a Naive Bayes classifier (10-fold cross validation).