使用HTML结构改进检索的新研究

Proceedings 11th International Conference on Tools with Artificial Intelligence Pub Date : 1999-11-08 DOI:10.1109/TAI.1999.809831

M. Cutler, H. Deng, S. Maniccam, W. Meng

{"title":"使用HTML结构改进检索的新研究","authors":"M. Cutler, H. Deng, S. Maniccam, W. Meng","doi":"10.1109/TAI.1999.809831","DOIUrl":null,"url":null,"abstract":"Locating useful information effectively form the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). The rationale is that terms appearing in different structures of a document may have different significance in identifying the document. The weighting schemes of traditional information retrieval were extended to include class importance values. We implemented a genetic algorithm to determine a \"best so far\" class importance factor combination. Our experiments indicate that using this technique the retrieval effectiveness can be improved by 39.6% or higher.","PeriodicalId":194023,"journal":{"name":"Proceedings 11th International Conference on Tools with Artificial Intelligence","volume":"203 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"A new study on using HTML structures to improve retrieval\",\"authors\":\"M. Cutler, H. Deng, S. Maniccam, W. Meng\",\"doi\":\"10.1109/TAI.1999.809831\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Locating useful information effectively form the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). The rationale is that terms appearing in different structures of a document may have different significance in identifying the document. The weighting schemes of traditional information retrieval were extended to include class importance values. We implemented a genetic algorithm to determine a \\\"best so far\\\" class importance factor combination. Our experiments indicate that using this technique the retrieval effectiveness can be improved by 39.6% or higher.\",\"PeriodicalId\":194023,\"journal\":{\"name\":\"Proceedings 11th International Conference on Tools with Artificial Intelligence\",\"volume\":\"203 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 11th International Conference on Tools with Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TAI.1999.809831\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 11th International Conference on Tools with Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TAI.1999.809831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 44

摘要

从万维网(WWW)中有效地定位有用的信息是人们广泛关注的问题。本文提出了一种利用HTML文档的结构和超链接来提高检索HTML文档效率的方法。这种方法根据出现特定术语的标记(如Title、H1-H6和Anchor)，将文档集合中出现的术语划分为类。其基本原理是，在文档的不同结构中出现的术语在识别文档时可能具有不同的意义。将传统信息检索的权重方案扩展到包含类重要值。我们实现了一种遗传算法来确定“迄今为止最佳”类重要因子组合。实验表明，采用该方法，检索效率可提高39.6%以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A new study on using HTML structures to improve retrieval

Locating useful information effectively form the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). The rationale is that terms appearing in different structures of a document may have different significance in identifying the document. The weighting schemes of traditional information retrieval were extended to include class importance values. We implemented a genetic algorithm to determine a "best so far" class importance factor combination. Our experiments indicate that using this technique the retrieval effectiveness can be improved by 39.6% or higher.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings 11th International Conference on Tools with Artificial Intelligence

自引率

0.00%

发文量