Heading-Aware Proximity Measure and Its Application to Web Search

Q4 Computer Science
Tomohiro Manabe, Keishi Tajima
{"title":"Heading-Aware Proximity Measure and Its Application to Web Search","authors":"Tomohiro Manabe, Keishi Tajima","doi":"10.11185/IMT.11.154","DOIUrl":null,"url":null,"abstract":"Proximity of query keyword occurrences is one important evidence which is useful for effective querybiased document scoring. If a query keyword occurs close to another in a document, it suggests high relevance of the document to the query. The simplest way to measure proximity between keyword occurrences is to use distance between them, i.e., difference of their positions. However, most web pages contain hierarchical structure composed of nested logical blocks with their headings, and it affects logical proximity. For example, if a keyword occurs in a block and another occurs in the heading of the block, we should not simply measure their proximity by their distance. This is because a heading describes the topic of the entire corresponding block, and term occurrences in a heading are strongly connected with any term occurrences in its associated block with less regard for the distance between them. Based on these observations, we developed a heading-aware proximity measure and applied it to three existing proximity-aware document scoring methods: MinDist, P6, and Span. We evaluated these existing methods and our modified methods on the data sets from TREC web tracks. The results indicate that our heading-aware proximity measure is better than the simple distance in all cases, and the method combining it with the Span method achieved the best performance.","PeriodicalId":16243,"journal":{"name":"Journal of Information Processing","volume":"11 1","pages":"154-159"},"PeriodicalIF":0.0000,"publicationDate":"2016-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.11185/IMT.11.154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 0

Abstract

Proximity of query keyword occurrences is one important evidence which is useful for effective querybiased document scoring. If a query keyword occurs close to another in a document, it suggests high relevance of the document to the query. The simplest way to measure proximity between keyword occurrences is to use distance between them, i.e., difference of their positions. However, most web pages contain hierarchical structure composed of nested logical blocks with their headings, and it affects logical proximity. For example, if a keyword occurs in a block and another occurs in the heading of the block, we should not simply measure their proximity by their distance. This is because a heading describes the topic of the entire corresponding block, and term occurrences in a heading are strongly connected with any term occurrences in its associated block with less regard for the distance between them. Based on these observations, we developed a heading-aware proximity measure and applied it to three existing proximity-aware document scoring methods: MinDist, P6, and Span. We evaluated these existing methods and our modified methods on the data sets from TREC web tracks. The results indicate that our heading-aware proximity measure is better than the simple distance in all cases, and the method combining it with the Span method achieved the best performance.
标题感知接近测度及其在网络搜索中的应用
查询关键字出现的接近度是一个重要的证据,它对有效的查询偏向文档评分很有用。如果查询关键字在文档中与另一个关键字出现得很近,则表明该文档与查询具有很高的相关性。衡量关键字出现之间的接近程度的最简单方法是使用它们之间的距离,即它们位置的差异。然而,大多数网页包含由嵌套的逻辑块和标题组成的层次结构,这影响了逻辑接近性。例如,如果一个关键字出现在一个块中,另一个出现在块的标题中,我们不应该简单地通过它们的距离来衡量它们的接近程度。这是因为标题描述了整个相应块的主题,标题中的术语出现与相关块中的任何术语出现紧密相连,而不太关心它们之间的距离。基于这些观察,我们开发了一个标题感知接近度量,并将其应用于三种现有的接近感知文档评分方法:MinDist、P6和Span。我们在TREC网络轨道数据集上对这些现有方法和改进后的方法进行了评价。结果表明,在所有情况下,我们的标题感知接近度量都优于简单的距离度量,并且将其与Span方法相结合的方法获得了最好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Information Processing
Journal of Information Processing Computer Science-Computer Science (all)
CiteScore
1.20
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信