阿拉伯语微博释义识别的语义文本扩展

Proceedings of the 14th International Conference on Management of Digital EcoSystems Pub Date : 2022-10-19 DOI:10.1145/3508397.3564848

B. Al-Shboul, Duha Al-Darras, D. A. Qudah

{"title":"阿拉伯语微博释义识别的语义文本扩展","authors":"B. Al-Shboul, Duha Al-Darras, D. A. Qudah","doi":"10.1145/3508397.3564848","DOIUrl":null,"url":null,"abstract":"An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.","PeriodicalId":266269,"journal":{"name":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts\",\"authors\":\"B. Al-Shboul, Duha Al-Darras, D. A. Qudah\",\"doi\":\"10.1145/3508397.3564848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.\",\"PeriodicalId\":266269,\"journal\":{\"name\":\"Proceedings of the 14th International Conference on Management of Digital EcoSystems\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th International Conference on Management of Digital EcoSystems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508397.3564848\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508397.3564848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

每天都有大量的微博被创建和发布在网络上。很多微博内容重复，话题相似。能够检测重复内容可以支持各种应用程序，如问答和流行话题检测。在本研究中，除了识别属于同一主题的推文外，我们还旨在提出一个模型来检测阿拉伯语推文之间的释义。该模型基于潜狄利克雷分配(Latent Dirichlet Allocation, LDA)主题建模，以及利用外部资源(如BabelNet和Wikipedia)进行语义文本扩展。来自多个阿拉伯新闻机构的推文被收集、预处理并分成两组。第一组tweets用于构建主题建模，另一组tweets基于主题分布进行配对和分类。结果在具有一定时间重叠的tweet对的精度方面是有希望的。使用维基百科嵌入的内容在具有大量LDA主题的词根文本模式上实现了80.1%的最佳报告精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts

An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 14th International Conference on Management of Digital EcoSystems

自引率

0.00%

发文量