基于Google的名称搜索:解决网络上的混合实体

2009 Fourth International Conference on Digital Information Management Pub Date : 2009-12-18 DOI:10.1109/ICDIM.2009.5356763

Byung-Won On, Ingyu Lee

{"title":"基于Google的名称搜索:解决网络上的混合实体","authors":"Byung-Won On, Ingyu Lee","doi":"10.1109/ICDIM.2009.5356763","DOIUrl":null,"url":null,"abstract":"When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.","PeriodicalId":300287,"journal":{"name":"2009 Fourth International Conference on Digital Information Management","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Google based name search: Resolving mixed entities on the web\",\"authors\":\"Byung-Won On, Ingyu Lee\",\"doi\":\"10.1109/ICDIM.2009.5356763\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.\",\"PeriodicalId\":300287,\"journal\":{\"name\":\"2009 Fourth International Conference on Digital Information Management\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 Fourth International Conference on Digital Information Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDIM.2009.5356763\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 Fourth International Conference on Digital Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2009.5356763","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

当使用非唯一值作为实体的标识符时，由于它们的同音，可能会出现混淆。特别是，当使用实体的部分“名称”作为其标识符时，该问题通常被称为混合实体解析问题，其目标是将由于名称同音而导致的错误实体分类(例如，如果仅使用姓氏作为标识符，则无法区分“Vannevar Bush”和“George Bush”)。特别是，混合实体解析问题在Web数据上很常见。例如，在Google中搜索一个产品名称(例如，Oracle)，由于名称同音，存在混合的网页(例如，Oracle Database, Oracle Audio, Oracle Academy等)。在本文中，我们提出了一个实用的系统来解决Web上的这种混合实体。为了开发这样一个系统，我们提出了一个基于web服务的接口，一个无监督聚类方案和聚类排序算法。特别是，由于正确的聚类数量通常是未知的，我们研究了基于实体成对相似性传播的最先进的无监督聚类解决方案。通过实验验证了我们的主张，表明我们的方法优于主要竞争解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Google based name search: Resolving mixed entities on the web

When non-unique values are used as the identifier of entities, due to their homonym, confusion can occur. In particular, when part of “names” of entities are used as their identifiers, the problem is often referred to as a mixed entity resolution problem, where goal is to sort out the erroneous entities due to name homonyms (e.g., if only last name is used as an identifier, one cannot distinguish “Vannevar Bush” from “George Bush”). Especially, a mixed entity resolution problem is common on the Web data. For instance, to search for a product name (e.g., Oracle) in Google, there exist a mixture of web pages due to the name homonyms (e.g., Oracle Database, Oracle Audio, Oracle Academy, etc.). In this paper, we present a practical system for resolving such mixed entities on the Web. For development of such a system, we propose a web service based interface, an unsu-pervised clustering scheme, and cluster ranking algorithms. In particular, since the correct number of clusters is often unknown, we study a state-of-the-art unsupervised clustering solution based on propagation of pairwise similarities of entities. Our claim is empirically validated via experimentation, showing that our approach outperforms main competing solution.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 Fourth International Conference on Digital Information Management

自引率

0.00%

发文量