{"title":"阿拉伯语微博释义识别的语义文本扩展","authors":"B. Al-Shboul, Duha Al-Darras, D. A. Qudah","doi":"10.1145/3508397.3564848","DOIUrl":null,"url":null,"abstract":"An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.","PeriodicalId":266269,"journal":{"name":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts\",\"authors\":\"B. Al-Shboul, Duha Al-Darras, D. A. Qudah\",\"doi\":\"10.1145/3508397.3564848\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.\",\"PeriodicalId\":266269,\"journal\":{\"name\":\"Proceedings of the 14th International Conference on Management of Digital EcoSystems\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 14th International Conference on Management of Digital EcoSystems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3508397.3564848\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th International Conference on Management of Digital EcoSystems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3508397.3564848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Semantic Text Expansion for Paraphrasing Identification in Arabic Microblog Posts
An enormous number of microblogs are being created and posted on the web each day. Many of these microblogs are repetitive in terms of content and similar in terms of topic. Being able to detect repetitive content can support various applications such as question answering and trendy topic detection. In this research, we aim to propose a model to detect paraphrasing among Arabic tweets, in addition to identifying tweets belonging to the same topic. The proposed model is based on Latent Dirichlet Allocation (LDA) topic modeling, as well as, semantic text expansion utilizing external resources i.e. BabelNet and Wikipedia. Tweets from multiple Arabic news agencies were collected, preprocessed, and divided into two groups. The first group was used to build the topic modeling and the other group of tweets was paired and classified based on the topic distributions. The results are promising in terms of precision on tweet pairs with a certain time overlap. The best-reported precision is 80.1% achieved using Wikipedia embedded content on the stemmed text mode with a large number of LDA topics.