{"title":"使用HTML结构改进检索的新研究","authors":"M. Cutler, H. Deng, S. Maniccam, W. Meng","doi":"10.1109/TAI.1999.809831","DOIUrl":null,"url":null,"abstract":"Locating useful information effectively form the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). The rationale is that terms appearing in different structures of a document may have different significance in identifying the document. The weighting schemes of traditional information retrieval were extended to include class importance values. We implemented a genetic algorithm to determine a \"best so far\" class importance factor combination. Our experiments indicate that using this technique the retrieval effectiveness can be improved by 39.6% or higher.","PeriodicalId":194023,"journal":{"name":"Proceedings 11th International Conference on Tools with Artificial Intelligence","volume":"203 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1999-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"44","resultStr":"{\"title\":\"A new study on using HTML structures to improve retrieval\",\"authors\":\"M. Cutler, H. Deng, S. Maniccam, W. Meng\",\"doi\":\"10.1109/TAI.1999.809831\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Locating useful information effectively form the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). The rationale is that terms appearing in different structures of a document may have different significance in identifying the document. The weighting schemes of traditional information retrieval were extended to include class importance values. We implemented a genetic algorithm to determine a \\\"best so far\\\" class importance factor combination. Our experiments indicate that using this technique the retrieval effectiveness can be improved by 39.6% or higher.\",\"PeriodicalId\":194023,\"journal\":{\"name\":\"Proceedings 11th International Conference on Tools with Artificial Intelligence\",\"volume\":\"203 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1999-11-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"44\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 11th International Conference on Tools with Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TAI.1999.809831\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 11th International Conference on Tools with Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TAI.1999.809831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A new study on using HTML structures to improve retrieval
Locating useful information effectively form the World Wide Web (WWW) is of wide interest. This paper presents new results on a methodology of using the structures and hyperlinks of HTML documents to improve the effectiveness of retrieving HTML documents. This methodology partitions the occurrences of terms in a document collection into classes according to the tags in which a particular term appears (such as Title, H1-H6, and Anchor). The rationale is that terms appearing in different structures of a document may have different significance in identifying the document. The weighting schemes of traditional information retrieval were extended to include class importance values. We implemented a genetic algorithm to determine a "best so far" class importance factor combination. Our experiments indicate that using this technique the retrieval effectiveness can be improved by 39.6% or higher.