SE-MSLC: Semantic Entropy-Driven Keyword Analysis and Multi-Stage Logical Combination Recall for Search Engine.

IF 2 3区物理与天体物理 Q2 PHYSICS, MULTIDISCIPLINARY

Entropy Pub Date : 2025-09-16 DOI:10.3390/e27090961

Haihua Lu, Liang Yu, Yantao He, Liwei Tian

{"title":"SE-MSLC: Semantic Entropy-Driven Keyword Analysis and Multi-Stage Logical Combination Recall for Search Engine.","authors":"Haihua Lu, Liang Yu, Yantao He, Liwei Tian","doi":"10.3390/e27090961","DOIUrl":null,"url":null,"abstract":"<p><p>Information retrieval serves as a critical methodology for accurately and efficiently obtaining the required information from massive amounts of data. In this paper, we propose an information retrieval framework (SE-MSLC) that utilizes information theory to improve the retrieval effectiveness of inverted index retrieval, thus achieving higher-quality retrieval results in intelligent vertical domain search engines. First, we propose a semantic entropy-driven keyword importance analysis method (SE-KIA) in the query understanding module. This method combines search query logs, the corpus of the search engine, and the theory of semantic entropy, enabling the search engine to dynamically adjust the weights of query keywords, thereby improving its ability to recognize user intent. Then, we propose a hybrid recall strategy that combines a multi-stage strategy and a logical combination strategy (HRS-MSLC) in the recall module. It separately recalls the keywords obtained from the multi-granularity word segmentation of the query in the form of multi-queue recall and simultaneously considers the \"AND\" and \"OR\" logical relationships between the keywords. By systematically managing retrieval uncertainty and giving priority to the keywords with high information content, it achieves the best balance between the quantity of the retrieval results and the relevance of the retrieval results to the query. Finally, we experimentally evaluate our methods using the Hit Rate@K and case analysis. Our results demonstrate that the proposed method improves the Hit Rate@1 by 7.3% and the Hit Rate@3 by 6.6% while effectively solving the bad cases in our vertical domain search engine.</p>","PeriodicalId":11694,"journal":{"name":"Entropy","volume":"27 9","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12468705/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Entropy","FirstCategoryId":"101","ListUrlMain":"https://doi.org/10.3390/e27090961","RegionNum":3,"RegionCategory":"物理与天体物理","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"PHYSICS, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

Information retrieval serves as a critical methodology for accurately and efficiently obtaining the required information from massive amounts of data. In this paper, we propose an information retrieval framework (SE-MSLC) that utilizes information theory to improve the retrieval effectiveness of inverted index retrieval, thus achieving higher-quality retrieval results in intelligent vertical domain search engines. First, we propose a semantic entropy-driven keyword importance analysis method (SE-KIA) in the query understanding module. This method combines search query logs, the corpus of the search engine, and the theory of semantic entropy, enabling the search engine to dynamically adjust the weights of query keywords, thereby improving its ability to recognize user intent. Then, we propose a hybrid recall strategy that combines a multi-stage strategy and a logical combination strategy (HRS-MSLC) in the recall module. It separately recalls the keywords obtained from the multi-granularity word segmentation of the query in the form of multi-queue recall and simultaneously considers the "AND" and "OR" logical relationships between the keywords. By systematically managing retrieval uncertainty and giving priority to the keywords with high information content, it achieves the best balance between the quantity of the retrieval results and the relevance of the retrieval results to the query. Finally, we experimentally evaluate our methods using the Hit Rate@K and case analysis. Our results demonstrate that the proposed method improves the Hit Rate@1 by 7.3% and the Hit Rate@3 by 6.6% while effectively solving the bad cases in our vertical domain search engine.

查看原文本刊更多论文

语义熵驱动的关键词分析与搜索引擎多阶段逻辑组合召回。

信息检索是准确、高效地从海量数据中获取所需信息的关键方法。本文提出了一种信息检索框架（SE-MSLC），利用信息理论提高倒立索引检索的检索效率，从而在智能垂直领域搜索引擎中获得更高质量的检索结果。首先，我们在查询理解模块中提出了语义熵驱动的关键字重要性分析方法（SE-KIA）。该方法结合搜索查询日志、搜索引擎语料库和语义熵理论，使搜索引擎能够动态调整查询关键字的权重，从而提高搜索引擎对用户意图的识别能力。然后，我们提出了一种混合召回策略，在召回模块中结合了多阶段策略和逻辑组合策略（HRS-MSLC）。它以多队列召回的形式分别召回查询的多粒度分词得到的关键字，同时考虑关键字之间的“与”和“或”逻辑关系。通过系统地管理检索不确定性，优先考虑信息含量高的关键词，在检索结果的数量和检索结果与查询的相关性之间达到最佳平衡。最后，我们使用Hit Rate@K和案例分析来实验评估我们的方法。结果表明，该方法在有效解决垂直域搜索引擎中出现的不良情况的同时，将命中Rate@1提高了7.3%，命中Rate@3提高了6.6%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Entropy PHYSICS, MULTIDISCIPLINARY-

CiteScore

4.90

自引率

11.10%

发文量

1580

审稿时长

21.05 days

期刊介绍： Entropy (ISSN 1099-4300), an international and interdisciplinary journal of entropy and information studies, publishes reviews, regular research papers and short notes. Our aim is to encourage scientists to publish as much as possible their theoretical and experimental details. There is no restriction on the length of the papers. If there are computation and the experiment, the details must be provided so that the results can be reproduced.