Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines

Zahraa Chreim, Hussein Hazimeh, Hassan Harb, Fouad Hannoun, Karl Daher, E. Mugellini, Omar Abou Khaled
{"title":"Reduce++: Unsupervised Content-Based Approach for Duplicate Result Detection in Search Engines","authors":"Zahraa Chreim, Hussein Hazimeh, Hassan Harb, Fouad Hannoun, Karl Daher, E. Mugellini, Omar Abou Khaled","doi":"10.5121/csit.2022.122211","DOIUrl":null,"url":null,"abstract":"Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain many duplications. In this paper, we introduce a new content-type-based similarity computation method to address this problem. Our approach divides the webpage into different types of content, such as title, subtitles, body, etc. Then, we find for each type a suitable similarity measure. Next, we add the different calculated similarity scores to get the final similarity score between the two documents, using a weighted formula. Finally, we suggest a new graph-based algorithm to cluster search results according to their similarity. We empirically evaluated our results with the Agglomerative Clustering, and we achieved about 61% reduction in web pages, 0.2757 Silhouette coefficient, 0.1269 Davies Bouldin Score, and 85 Calinski Harabasz Score.","PeriodicalId":153862,"journal":{"name":"Signal Processing and Vision","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing and Vision","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5121/csit.2022.122211","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Search engines are among the most popular web services on the World Wide Web. They facilitate the process of finding information using a query-result mechanism. However, results returned by search engines contain many duplications. In this paper, we introduce a new content-type-based similarity computation method to address this problem. Our approach divides the webpage into different types of content, such as title, subtitles, body, etc. Then, we find for each type a suitable similarity measure. Next, we add the different calculated similarity scores to get the final similarity score between the two documents, using a weighted formula. Finally, we suggest a new graph-based algorithm to cluster search results according to their similarity. We empirically evaluated our results with the Agglomerative Clustering, and we achieved about 61% reduction in web pages, 0.2757 Silhouette coefficient, 0.1269 Davies Bouldin Score, and 85 Calinski Harabasz Score.
基于无监督内容的搜索引擎重复结果检测方法
搜索引擎是万维网上最流行的网络服务之一。它们简化了使用查询结果机制查找信息的过程。然而,搜索引擎返回的结果包含许多重复。本文引入了一种新的基于内容类型的相似度计算方法来解决这一问题。我们的方法是将网页分成不同类型的内容,如标题、副标题、正文等。然后,我们为每种类型找到合适的相似性度量。接下来,我们使用加权公式将计算出的不同相似度分数相加,得到两个文档之间的最终相似度分数。最后,我们提出了一种新的基于图的搜索结果聚类算法。我们使用聚集聚类对我们的结果进行了实证评估,我们实现了约61%的网页减少,剪形系数为0.2757,Davies Bouldin得分为0.1269,Calinski Harabasz得分为85。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信