阿拉伯语微博释义识别的语义文本扩展

B. Al-Shboul, Duha Al-Darras, D. A. Qudah
{"title":"阿拉伯语微博释义识别的语义文本扩展","authors":"B. Al-Shboul, Duha Al-Darras, D. A. Qudah","doi":"10.1145/3508397.3564848","DOIUrl":null,"url":null,"abstract":"An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.","PeriodicalId":266269,"journal":{"name":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts\",\"authors\":\"B. Al-Shboul, Duha Al-Darras, D. A. Qudah\",\"doi\":\"10.1145/3508397.3564848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.\",\"PeriodicalId\":266269,\"journal\":{\"name\":\"Proceedings of the 14th International Conference on Management of Digital EcoSystems\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th International Conference on Management of Digital EcoSystems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508397.3564848\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508397.3564848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

每天都有大量的微博被创建和发布在网络上。很多微博内容重复,话题相似。能够检测重复内容可以支持各种应用程序,如问答和流行话题检测。在本研究中,除了识别属于同一主题的推文外,我们还旨在提出一个模型来检测阿拉伯语推文之间的释义。该模型基于潜狄利克雷分配(Latent Dirichlet Allocation, LDA)主题建模,以及利用外部资源(如BabelNet和Wikipedia)进行语义文本扩展。来自多个阿拉伯新闻机构的推文被收集、预处理并分成两组。第一组tweets用于构建主题建模,另一组tweets基于主题分布进行配对和分类。结果在具有一定时间重叠的tweet对的精度方面是有希望的。使用维基百科嵌入的内容在具有大量LDA主题的词根文本模式上实现了80.1%的最佳报告精度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts
An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信