Entity set expansion in opinion documents

HT ... : the proceedings of the ... ACM Conference on Hypertext and Social Media. ACM Conference on Hypertext and Social Media Pub Date : 2011-06-06 DOI:10.1145/1995966.1996002

Lei Zhang, B. Liu

{"title":"Entity set expansion in opinion documents","authors":"Lei Zhang, B. Liu","doi":"10.1145/1995966.1996002","DOIUrl":null,"url":null,"abstract":"Opinion mining has been an active research area in recent years. The task is to extract opinions expressed on entities and their attributes. For example, the sentence, \"I love the picture quality of Sony cameras,\" expresses a positive opinion on the picture quality attribute of Sony cameras. Sony is the entity. This paper focuses on mining entities (e.g., Sony). This is an important problem because without knowing the entity, the extracted opinion is of little use. The problem is similar to the classic named entity recognition problem. However, there is a major difference. In a typical opinion mining application, the user wants to find opinions on some competing entities, e.g., competing or relevant products. However, he/she often can only provide a few names as there are too many of them. The system has to find the rest from a corpus. This implies that the discovered entities must be of the same type/class. This is the set expansion problem. Classic methods for solving the problem are based on distributional similarity. However, we found this method is inaccurate. We then employ a learning-based method called Bayesian Sets. However, directly applying Bayesian Sets produces poor results. We then propose a more sophisticated way to use Bayesian Sets. This method, however, causes two major problems: entity ranking and feature sparseness. For entity ranking, we propose a re-ranking method to solve the problem. For feature sparseness, we propose two methods to re-weight features and to determine the quality of features. These methods help improve the mining results substantially. Additionally, like any learning algorithm, Bayesian Sets requires the user to engineer a set of features. We design some generic features based on part-of-speech tags of words for learning, which thus does not need to engineer features for each specific domain. Experimental results using 10 real-life datasets from diverse domains demonstrated the effectiveness of the proposed technique.","PeriodicalId":91270,"journal":{"name":"HT ... : the proceedings of the ... ACM Conference on Hypertext and Social Media. ACM Conference on Hypertext and Social Media","volume":"122 1","pages":"281-290"},"PeriodicalIF":0.0000,"publicationDate":"2011-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HT ... : the proceedings of the ... ACM Conference on Hypertext and Social Media. ACM Conference on Hypertext and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1995966.1996002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

Abstract

Opinion mining has been an active research area in recent years. The task is to extract opinions expressed on entities and their attributes. For example, the sentence, "I love the picture quality of Sony cameras," expresses a positive opinion on the picture quality attribute of Sony cameras. Sony is the entity. This paper focuses on mining entities (e.g., Sony). This is an important problem because without knowing the entity, the extracted opinion is of little use. The problem is similar to the classic named entity recognition problem. However, there is a major difference. In a typical opinion mining application, the user wants to find opinions on some competing entities, e.g., competing or relevant products. However, he/she often can only provide a few names as there are too many of them. The system has to find the rest from a corpus. This implies that the discovered entities must be of the same type/class. This is the set expansion problem. Classic methods for solving the problem are based on distributional similarity. However, we found this method is inaccurate. We then employ a learning-based method called Bayesian Sets. However, directly applying Bayesian Sets produces poor results. We then propose a more sophisticated way to use Bayesian Sets. This method, however, causes two major problems: entity ranking and feature sparseness. For entity ranking, we propose a re-ranking method to solve the problem. For feature sparseness, we propose two methods to re-weight features and to determine the quality of features. These methods help improve the mining results substantially. Additionally, like any learning algorithm, Bayesian Sets requires the user to engineer a set of features. We design some generic features based on part-of-speech tags of words for learning, which thus does not need to engineer features for each specific domain. Experimental results using 10 real-life datasets from diverse domains demonstrated the effectiveness of the proposed technique.

查看原文本刊更多论文

意见文件中的实体集扩展

近年来，舆论挖掘一直是一个活跃的研究领域。任务是提取对实体及其属性表达的意见。例如，“I love the picture quality of Sony相机”这句话表达了对索尼相机的画质属性的积极评价。索尼是实体。本文关注的是矿业实体(例如索尼)。这是一个重要的问题，因为在不了解实体的情况下，提取的意见几乎没有用处。这个问题类似于经典的命名实体识别问题。然而，有一个主要的区别。在一个典型的意见挖掘应用程序中，用户想要找到一些竞争实体的意见，例如，竞争或相关的产品。然而，他/她往往只能提供几个名字，因为他们太多了。系统必须从语料库中找到其余的。这意味着发现的实体必须具有相同的类型/类。这是集合展开问题。解决该问题的经典方法是基于分布相似度。然而，我们发现这种方法是不准确的。然后我们使用一种基于学习的方法，称为贝叶斯集。然而，直接应用贝叶斯集会产生很差的结果。然后，我们提出了一种更复杂的方法来使用贝叶斯集。然而，这种方法会导致两个主要问题:实体排序和特征稀疏性。对于实体排序，我们提出了一种重新排序的方法来解决这个问题。对于特征稀疏性，我们提出了两种方法来重新加权特征和确定特征的质量。这些方法大大提高了采矿效果。此外，像任何学习算法一样，贝叶斯集要求用户设计一组特征。我们设计了一些基于词性标签的通用特征用于学习，这样就不需要为每个特定领域设计特征。使用来自不同领域的10个真实数据集的实验结果证明了所提出技术的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

HT ... : the proceedings of the ... ACM Conference on Hypertext and Social Media. ACM Conference on Hypertext and Social Media

自引率

0.00%

发文量