Similarity Detection of Time-Sensitive Online News Articles Based on RSS Feeds and Contextual Data

Q2 Computer Science

Annals of Emerging Technologies in Computing Pub Date : 2023-01-01 DOI:10.33166/aetic.2023.01.006

Mohammad Daoud

{"title":"Similarity Detection of Time-Sensitive Online News Articles Based on RSS Feeds and Contextual Data","authors":"Mohammad Daoud","doi":"10.33166/aetic.2023.01.006","DOIUrl":null,"url":null,"abstract":"This article tackles the problem of finding similarity between web time-sensitive news articles, which can be a challenge. This challenge was approached with a novel methodology that uses supervised learning algorithms with carefully selected features (Semantic, Lexical and Temporal features (content and contextual features)). The proposed approach considers not only the textual content, which is a well-studied approach that may yield misleading results, but also the context, community engagement, and community-deduced importance of that news article. This paper details the major procedures of title pair pre-processing, analysis of lexical units, feature engineering, and similarity measures. Thousands of web articles are being published every second, and therefore, it is essential to determine the similarity of these articles efficiently without wasting time on unnecessary text processing of the bodies. Hence, the proposed approach focuses on short contents (titles) and context. The conducted experiment showed high precision and accuracy on a Really Simple Syndication (RSS) dataset of 8000 Arabic news article pairs collected automatically from 10 different news sources. The proposed approach achieved an accuracy of 0.81. Contextual features increased the accuracy and the precision. The proposed algorithm achieved a 0.89 correlation with the evaluations of two human judges based on Pearson’s Correlation Coefficient. The results outperform the state-of-the-art systems on Arabic news articles.","PeriodicalId":36440,"journal":{"name":"Annals of Emerging Technologies in Computing","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Emerging Technologies in Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.33166/aetic.2023.01.006","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 0

Abstract

This article tackles the problem of finding similarity between web time-sensitive news articles, which can be a challenge. This challenge was approached with a novel methodology that uses supervised learning algorithms with carefully selected features (Semantic, Lexical and Temporal features (content and contextual features)). The proposed approach considers not only the textual content, which is a well-studied approach that may yield misleading results, but also the context, community engagement, and community-deduced importance of that news article. This paper details the major procedures of title pair pre-processing, analysis of lexical units, feature engineering, and similarity measures. Thousands of web articles are being published every second, and therefore, it is essential to determine the similarity of these articles efficiently without wasting time on unnecessary text processing of the bodies. Hence, the proposed approach focuses on short contents (titles) and context. The conducted experiment showed high precision and accuracy on a Really Simple Syndication (RSS) dataset of 8000 Arabic news article pairs collected automatically from 10 different news sources. The proposed approach achieved an accuracy of 0.81. Contextual features increased the accuracy and the precision. The proposed algorithm achieved a 0.89 correlation with the evaluations of two human judges based on Pearson’s Correlation Coefficient. The results outperform the state-of-the-art systems on Arabic news articles.

查看原文本刊更多论文

基于RSS源和上下文数据的时效性在线新闻文章相似度检测

本文解决了在网络时间敏感型新闻文章之间寻找相似性的问题，这可能是一个挑战。我们采用了一种新颖的方法来应对这一挑战，该方法使用了带有精心选择的特征(语义、词汇和时间特征(内容和上下文特征))的监督学习算法。所提出的方法不仅考虑了文本内容(这是一种经过充分研究的方法，可能会产生误导性的结果)，还考虑了新闻文章的背景、社区参与和社区推断的重要性。本文详细介绍了标题对预处理、词汇单位分析、特征工程和相似度度量的主要步骤。每秒钟都有成千上万的网络文章被发布，因此，有效地确定这些文章的相似性是至关重要的，而不是浪费时间在不必要的正文文本处理上。因此，建议的方法侧重于短内容(标题)和上下文。所进行的实验显示，在从10个不同的新闻来源自动收集的8000个阿拉伯语新闻文章对的RSS数据集上，具有很高的精度和准确性。该方法的准确率为0.81。上下文特征提高了准确性和精度。基于Pearson’s correlation Coefficient，该算法与两名人类裁判的评价相关度达到0.89。结果优于最先进的阿拉伯语新闻文章系统。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Emerging Technologies in Computing Computer Science-Computer Science (all)

CiteScore

3.50

自引率

0.00%

发文量