情感分析的监督术语加权

Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics Pub Date : 2011-07-10 DOI:10.1109/ISI.2011.5984056

Tam T. Nguyen, Kuiyu Chang, S. Hui

{"title":"情感分析的监督术语加权","authors":"Tam T. Nguyen, Kuiyu Chang, S. Hui","doi":"10.1109/ISI.2011.5984056","DOIUrl":null,"url":null,"abstract":"Vector space text classification is commonly used in intelligence applications such as email and conversation analysis. In this paper we propose a supervised term weighting scheme called tƒ × KL (term frequency Kullback-Leibler), which weights each word proportionally to the ratio of its document frequency across the positive and negative class. We then generalize tƒ × KL to effectively deal with class imbalance, which is very common in real world intelligence analysis. The generalized tƒ × KL weights each word according to the ratio of the positive and negative class conditioned word probabilities instead of the raw document frequencies. Results on four classification datasets show tƒ × KL to perform consistently better than the baseline tƒ ×idƒ and 4 other supervised term weighting schemes, including the recently proposed tƒ × rƒ (term frequency relevance frequency). The generalized tƒ × KL was found to be extremely robust in dealing with highly skewed class distributions, beating the second runner-up by more than 20% on a dataset that has only 10% positive training examples. The generalized tƒ × KL is thus an effective and robust term weighting scheme that can significantly improve binary classification performance in sentiment analysis and intelligence applications.","PeriodicalId":220165,"journal":{"name":"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics","volume":"61 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"Supervised term weighting for sentiment analysis\",\"authors\":\"Tam T. Nguyen, Kuiyu Chang, S. Hui\",\"doi\":\"10.1109/ISI.2011.5984056\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vector space text classification is commonly used in intelligence applications such as email and conversation analysis. In this paper we propose a supervised term weighting scheme called tƒ × KL (term frequency Kullback-Leibler), which weights each word proportionally to the ratio of its document frequency across the positive and negative class. We then generalize tƒ × KL to effectively deal with class imbalance, which is very common in real world intelligence analysis. The generalized tƒ × KL weights each word according to the ratio of the positive and negative class conditioned word probabilities instead of the raw document frequencies. Results on four classification datasets show tƒ × KL to perform consistently better than the baseline tƒ ×idƒ and 4 other supervised term weighting schemes, including the recently proposed tƒ × rƒ (term frequency relevance frequency). The generalized tƒ × KL was found to be extremely robust in dealing with highly skewed class distributions, beating the second runner-up by more than 20% on a dataset that has only 10% positive training examples. The generalized tƒ × KL is thus an effective and robust term weighting scheme that can significantly improve binary classification performance in sentiment analysis and intelligence applications.\",\"PeriodicalId\":220165,\"journal\":{\"name\":\"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics\",\"volume\":\"61 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISI.2011.5984056\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISI.2011.5984056","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

向量空间文本分类通常用于智能应用，如电子邮件和会话分析。在本文中，我们提出了一种称为tf × KL(术语频率Kullback-Leibler)的监督术语加权方案，该方案根据每个单词在正负类中的文档频率比例对其进行加权。然后我们推广tf × KL来有效地处理类不平衡，这在现实世界的智能分析中很常见。广义tf × KL根据正负类条件词概率的比值而不是原始文档频率对每个词进行加权。在四个分类数据集上的结果显示，tf × KL的表现始终优于基线tf ×idƒ和其他4种监督术语加权方案，包括最近提出的tf × rf(术语频率相关频率)。我们发现广义的tf × KL在处理高度倾斜的类分布方面非常稳健，在只有10%的正训练样本的数据集上，它比第二名高出20%以上。因此，广义的tf × KL是一种有效且鲁棒的术语加权方案，可以显著提高情感分析和智能应用中的二元分类性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Supervised term weighting for sentiment analysis

Vector space text classification is commonly used in intelligence applications such as email and conversation analysis. In this paper we propose a supervised term weighting scheme called tƒ × KL (term frequency Kullback-Leibler), which weights each word proportionally to the ratio of its document frequency across the positive and negative class. We then generalize tƒ × KL to effectively deal with class imbalance, which is very common in real world intelligence analysis. The generalized tƒ × KL weights each word according to the ratio of the positive and negative class conditioned word probabilities instead of the raw document frequencies. Results on four classification datasets show tƒ × KL to perform consistently better than the baseline tƒ ×idƒ and 4 other supervised term weighting schemes, including the recently proposed tƒ × rƒ (term frequency relevance frequency). The generalized tƒ × KL was found to be extremely robust in dealing with highly skewed class distributions, beating the second runner-up by more than 20% on a dataset that has only 10% positive training examples. The generalized tƒ × KL is thus an effective and robust term weighting scheme that can significantly improve binary classification performance in sentiment analysis and intelligence applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of 2011 IEEE International Conference on Intelligence and Security Informatics

自引率

0.00%

发文量