Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

Signal Processing and Vision Pub Date : 2022-12-17 DOI:10.5121/csit.2022.122211

Zahraa Chreim, Hussein Hazimeh, Hassan Harb, Fouad Hannoun, Karl Daher, E. Mugellini, Omar Abou Khaled

引用次数: 0

Abstract

Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain many duplications. In this paper, we introduce a new content-type-based similarity computation method to address this problem. Our approach divides the webpage into different types of content, such as title, subtitles, body, etc. Then, we find for each type a suitable similarity measure. Next, we add the different calculated similarity scores to get the final similarity score between the two documents, using a weighted formula. Finally, we suggest a new graph-based algorithm to cluster search results according to their similarity. We empirically evaluated our results with the Agglomerative Clustering, and we achieved about 61% reduction in web pages, 0.2757 Silhouette coefficient, 0.1269 Davies Bouldin Score, and 85 Calinski Harabasz Score.

查看原文本刊更多论文

基于无监督内容的搜索引擎重复结果检测方法

搜索引擎是万维网上最流行的网络服务之一。它们简化了使用查询结果机制查找信息的过程。然而，搜索引擎返回的结果包含许多重复。本文引入了一种新的基于内容类型的相似度计算方法来解决这一问题。我们的方法是将网页分成不同类型的内容，如标题、副标题、正文等。然后，我们为每种类型找到合适的相似性度量。接下来，我们使用加权公式将计算出的不同相似度分数相加，得到两个文档之间的最终相似度分数。最后，我们提出了一种新的基于图的搜索结果聚类算法。我们使用聚集聚类对我们的结果进行了实证评估，我们实现了约61%的网页减少，剪形系数为0.2757,Davies Bouldin得分为0.1269,Calinski Harabasz得分为85。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing and Vision

自引率

0.00%

发文量