一种新的基于区间树的文档评分模型

Q3 Computer Science
Zeyu Xiong, Yijie Wang
{"title":"一种新的基于区间树的文档评分模型","authors":"Zeyu Xiong,&nbsp;Yijie Wang","doi":"10.1016/j.jvlc.2018.01.003","DOIUrl":null,"url":null,"abstract":"<div><p>Classical BM25 scoring is designed for unstructured documents. In the past years, people try to adapt the BM25 ranking formula to deal with structured documents. Most works on structured document retrieval treat the combination of field scores, but it is hard to determine the field weights before the formation of document score. We aim to establish a new method to sort the field weights. The motivation comes from two aspects. On the one hand, the construction of interval tree reflects retrieval results with higher-order proximity for a text field. According to writing style, the important sentence or phrase for representing main idea frequently appear in the front or the rear part of a text-field. Therefore, the proximity scoring for different part in a text-field should be different. We thus take higher factor for calculating proximity scoring in the front and the rear parts than in the middle part. On the other hand, the more the interval length includes inquiring terms, the less the proximity scoring is, thereby the higher <em>tf</em> value for term appearing in an interval should affect the computation of proximity scoring. Therefore, we develop a new method for calculating the field weights based on the ranking score. The ranking score for each field can be calculated by interval tree based on terms relevance. Interval tree can be viewed as a tool of higher terms proximity in text visualization. This new field weights reflect the terms proximity and can be used to calculate document scoring for terms retrieval. Experimental results show that the new document scoring model well reflects the terms proximity, and the new document scoring scheme ScoreComp, combined with interval scoring, is more sensitive than scheme FreqComp combined with interval scoring.</p></div>","PeriodicalId":54754,"journal":{"name":"Journal of Visual Languages and Computing","volume":"45 ","pages":"Pages 39-43"},"PeriodicalIF":0.0000,"publicationDate":"2018-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1016/j.jvlc.2018.01.003","citationCount":"1","resultStr":"{\"title\":\"New document scoring model based on interval tree\",\"authors\":\"Zeyu Xiong,&nbsp;Yijie Wang\",\"doi\":\"10.1016/j.jvlc.2018.01.003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Classical BM25 scoring is designed for unstructured documents. In the past years, people try to adapt the BM25 ranking formula to deal with structured documents. Most works on structured document retrieval treat the combination of field scores, but it is hard to determine the field weights before the formation of document score. We aim to establish a new method to sort the field weights. The motivation comes from two aspects. On the one hand, the construction of interval tree reflects retrieval results with higher-order proximity for a text field. According to writing style, the important sentence or phrase for representing main idea frequently appear in the front or the rear part of a text-field. Therefore, the proximity scoring for different part in a text-field should be different. We thus take higher factor for calculating proximity scoring in the front and the rear parts than in the middle part. On the other hand, the more the interval length includes inquiring terms, the less the proximity scoring is, thereby the higher <em>tf</em> value for term appearing in an interval should affect the computation of proximity scoring. Therefore, we develop a new method for calculating the field weights based on the ranking score. The ranking score for each field can be calculated by interval tree based on terms relevance. Interval tree can be viewed as a tool of higher terms proximity in text visualization. This new field weights reflect the terms proximity and can be used to calculate document scoring for terms retrieval. Experimental results show that the new document scoring model well reflects the terms proximity, and the new document scoring scheme ScoreComp, combined with interval scoring, is more sensitive than scheme FreqComp combined with interval scoring.</p></div>\",\"PeriodicalId\":54754,\"journal\":{\"name\":\"Journal of Visual Languages and Computing\",\"volume\":\"45 \",\"pages\":\"Pages 39-43\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1016/j.jvlc.2018.01.003\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Visual Languages and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1045926X17302811\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Languages and Computing","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1045926X17302811","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 1

摘要

经典的BM25评分是为非结构化文档设计的。在过去的几年里,人们试图调整BM25排名公式来处理结构化文档。大多数结构化文档检索工作都处理字段得分的组合,但在形成文档得分之前很难确定字段权重。我们的目的是建立一种新的方法来排序字段权重。动机来自两个方面。一方面,区间树的构造反映了对文本字段具有高阶贴近度的检索结果。根据写作风格,代表大意的重要句子或短语经常出现在文本字段的前部或后部。因此,文本字段中不同部分的接近度得分应该不同。因此,我们在计算前部和后部的接近度得分时采用了比中部更高的系数。另一方面,区间长度包括的查询项越多,接近度得分就越少,因此区间中出现的项的tf值越高,应该会影响接近度得分的计算。因此,我们开发了一种基于排名得分计算字段权重的新方法。每个字段的排名得分可以通过基于术语相关性的区间树来计算。在文本可视化中,区间树可以被视为一种术语接近度较高的工具。这个新的字段权重反映了术语的接近度,并可用于计算术语检索的文档评分。实验结果表明,新的文档评分模型很好地反映了术语的接近性,并且与区间评分相结合的新文档评分方案ScoreComp比与区间评分结合的方案FreqComp更敏感。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
New document scoring model based on interval tree

Classical BM25 scoring is designed for unstructured documents. In the past years, people try to adapt the BM25 ranking formula to deal with structured documents. Most works on structured document retrieval treat the combination of field scores, but it is hard to determine the field weights before the formation of document score. We aim to establish a new method to sort the field weights. The motivation comes from two aspects. On the one hand, the construction of interval tree reflects retrieval results with higher-order proximity for a text field. According to writing style, the important sentence or phrase for representing main idea frequently appear in the front or the rear part of a text-field. Therefore, the proximity scoring for different part in a text-field should be different. We thus take higher factor for calculating proximity scoring in the front and the rear parts than in the middle part. On the other hand, the more the interval length includes inquiring terms, the less the proximity scoring is, thereby the higher tf value for term appearing in an interval should affect the computation of proximity scoring. Therefore, we develop a new method for calculating the field weights based on the ranking score. The ranking score for each field can be calculated by interval tree based on terms relevance. Interval tree can be viewed as a tool of higher terms proximity in text visualization. This new field weights reflect the terms proximity and can be used to calculate document scoring for terms retrieval. Experimental results show that the new document scoring model well reflects the terms proximity, and the new document scoring scheme ScoreComp, combined with interval scoring, is more sensitive than scheme FreqComp combined with interval scoring.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Visual Languages and Computing
Journal of Visual Languages and Computing 工程技术-计算机:软件工程
CiteScore
1.62
自引率
0.00%
发文量
0
审稿时长
26.8 weeks
期刊介绍: The Journal of Visual Languages and Computing is a forum for researchers, practitioners, and developers to exchange ideas and results for the advancement of visual languages and its implication to the art of computing. The journal publishes research papers, state-of-the-art surveys, and review articles in all aspects of visual languages.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信