Applying site information to information retrieval from the Web

Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002. Pub Date : 2002-12-12 DOI:10.1109/WISE.2002.1181646

Yasuhito Asano, H. Imai, Masashi Toyoda, M. Kitsuregawa

{"title":"Applying site information to information retrieval from the Web","authors":"Yasuhito Asano, H. Imai, Masashi Toyoda, M. Kitsuregawa","doi":"10.1109/WISE.2002.1181646","DOIUrl":null,"url":null,"abstract":"In recent years, several information retrieval methods using information about Web-links have been developed, such as HITS and trawling. In order to analyze Web-links dividing into links inside each Web site (local-links) and links between Web sites (global-links)for information retrieval, a proper model of the Web site is required. In existing research, a Web server is used as a model of the Web site. This idea works relatively well when a Web site corresponds to a server, as is the case for public Web sites, but works poorly when multiple Web sites correspond to a server, as is the case for private Web sites on rental Web servers. We propose a new model of the Web site, \"directory-based site\", to handle typical private sites, and a method to identify them using information about the URL and Web-links. We verify the method can approximately identify, at a rate of 66% of over 110,000 servers, whether each server has multiple directory-based sites or not, and extract over 500,000 directory-based sites and 4 million global-links by computational experiments using jp-domain URLs and Web-link data contains over 23 million URLs and 100 million Web-links, collected from July to August 2000, by Toyoda and Kitsuregawa. We also propose a new framework of Web-link based information retrieval that uses directory-based sites and global-links instead of Web pages and whole Web-links respectively, and examine the effectiveness of our framework by comparing a result of trawling on our framework to one on the existing framework.","PeriodicalId":392999,"journal":{"name":"Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WISE.2002.1181646","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

In recent years, several information retrieval methods using information about Web-links have been developed, such as HITS and trawling. In order to analyze Web-links dividing into links inside each Web site (local-links) and links between Web sites (global-links)for information retrieval, a proper model of the Web site is required. In existing research, a Web server is used as a model of the Web site. This idea works relatively well when a Web site corresponds to a server, as is the case for public Web sites, but works poorly when multiple Web sites correspond to a server, as is the case for private Web sites on rental Web servers. We propose a new model of the Web site, "directory-based site", to handle typical private sites, and a method to identify them using information about the URL and Web-links. We verify the method can approximately identify, at a rate of 66% of over 110,000 servers, whether each server has multiple directory-based sites or not, and extract over 500,000 directory-based sites and 4 million global-links by computational experiments using jp-domain URLs and Web-link data contains over 23 million URLs and 100 million Web-links, collected from July to August 2000, by Toyoda and Kitsuregawa. We also propose a new framework of Web-link based information retrieval that uses directory-based sites and global-links instead of Web pages and whole Web-links respectively, and examine the effectiveness of our framework by comparing a result of trawling on our framework to one on the existing framework.

查看原文本刊更多论文

将站点信息应用于Web信息检索

近年来，利用网络链接信息的信息检索方法得到了发展，如HITS和拖网检索等。为了分析将Web链接划分为每个网站内部的链接(本地链接)和网站之间的链接(全局链接)以进行信息检索，需要一个适当的Web站点模型。在现有的研究中，Web服务器被用作Web站点的模型。当一个Web站点对应于一台服务器时(如公共Web站点的情况)，这种想法相对有效，但是当多个Web站点对应于一台服务器时(如租用Web服务器上的私有Web站点的情况)，这种想法就不太有效。我们提出了一种新的网站模型——“基于目录的网站”来处理典型的私有网站，并提出了一种使用URL和Web链接信息来识别它们的方法。我们验证了该方法可以在11万多台服务器中以66%的准确率近似识别出每个服务器是否有多个基于目录的站点，并通过使用jp域url的计算实验提取了50多万个基于目录的站点和400万个全局链接，web链接数据包含2300多万个url和1亿个web链接，这些数据是由Toyoda和Kitsuregawa于2000年7月至8月收集的。我们还提出了一个新的基于Web链接的信息检索框架，该框架分别使用基于目录的站点和全局链接来代替网页和整个Web链接，并通过比较我们的框架和现有框架的拖网结果来检验我们的框架的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Third International Conference on Web Information Systems Engineering, 2002. WISE 2002.

自引率

0.00%

发文量