Classification of colloquial Arabic tweets in real-time to detect high-risk floods

2017 International Conference On Social Media, Wearable And Web Analytics (Social Media) Pub Date : 2017-06-01 DOI:10.1109/SOCIALMEDIA.2017.8057358

Waleed Alabbas, Haider M. Al-Khateeb, Ali Mansour, G. Epiphaniou, Ingo Frommholz

{"title":"Classification of colloquial Arabic tweets in real-time to detect high-risk floods","authors":"Waleed Alabbas, Haider M. Al-Khateeb, Ali Mansour, G. Epiphaniou, Ingo Frommholz","doi":"10.1109/SOCIALMEDIA.2017.8057358","DOIUrl":null,"url":null,"abstract":"Twitter has eased real-time information flow for decision makers, it is also one of the key enablers for Open-source Intelligence (OSINT). Tweets mining has recently been used in the context of incident response to estimate the location and damage caused by hurricanes and earthquakes. We aim to research the detection of a specific type of high-risk natural disasters frequently occurring and causing casualties in the Arabian Peninsula, namely ‘floods’. Researching how we could achieve accurate classification suitable for short informal (colloquial) Arabic text (usually used on Twitter), which is highly inconsistent and received very little attention in this field. First, we provide a thorough technical demonstration consisting of the following stages: data collection (Twitter REST API), labelling, text pre-processing, data division and representation, and training models. This has been deployed using ‘R’ in our experiment. We then evaluate classifiers’ performance via four experiments conducted to measure the impact of different stemming techniques on the following classifiers SVM, J48, C5.0, NNET, NB and k-NN. The dataset used consisted of 1434 tweets in total. Our findings show that Support Vector Machine (SVM) was prominent in terms of accuracy (F1=0.933). Furthermore, applying McNemar’s test shows that using SVM without stemming on Colloquial Arabic is significantly better than using stemming techniques.","PeriodicalId":372822,"journal":{"name":"2017 International Conference On Social Media, Wearable And Web Analytics (Social Media)","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference On Social Media, Wearable And Web Analytics (Social Media)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SOCIALMEDIA.2017.8057358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Twitter has eased real-time information flow for decision makers, it is also one of the key enablers for Open-source Intelligence (OSINT). Tweets mining has recently been used in the context of incident response to estimate the location and damage caused by hurricanes and earthquakes. We aim to research the detection of a specific type of high-risk natural disasters frequently occurring and causing casualties in the Arabian Peninsula, namely ‘floods’. Researching how we could achieve accurate classification suitable for short informal (colloquial) Arabic text (usually used on Twitter), which is highly inconsistent and received very little attention in this field. First, we provide a thorough technical demonstration consisting of the following stages: data collection (Twitter REST API), labelling, text pre-processing, data division and representation, and training models. This has been deployed using ‘R’ in our experiment. We then evaluate classifiers’ performance via four experiments conducted to measure the impact of different stemming techniques on the following classifiers SVM, J48, C5.0, NNET, NB and k-NN. The dataset used consisted of 1434 tweets in total. Our findings show that Support Vector Machine (SVM) was prominent in terms of accuracy (F1=0.933). Furthermore, applying McNemar’s test shows that using SVM without stemming on Colloquial Arabic is significantly better than using stemming techniques.

查看原文本刊更多论文

实时对阿拉伯语推文进行分类，以检测高风险洪水

Twitter为决策者简化了实时信息流，它也是开源情报(OSINT)的关键推动者之一。推特挖掘最近被用于事件响应的背景下，以估计飓风和地震造成的位置和损害。我们的目标是研究在阿拉伯半岛频繁发生并造成人员伤亡的一种特定类型的高风险自然灾害的检测，即“洪水”。研究如何实现适合简短的非正式(口语化)阿拉伯语文本(通常在Twitter上使用)的准确分类，这是高度不一致的，在该领域很少受到关注。首先，我们提供了一个全面的技术演示，包括以下几个阶段:数据收集(Twitter REST API)、标记、文本预处理、数据划分和表示以及训练模型。这在我们的实验中使用了“R”。然后，我们通过四个实验来评估分类器的性能，以衡量不同词干提取技术对以下分类器SVM、J48、C5.0、NNET、NB和k-NN的影响。使用的数据集总共由1434条推文组成。我们的研究结果表明，支持向量机(SVM)在准确率方面是突出的(F1=0.933)。此外，应用McNemar的测试表明，使用支持向量机而不使用词干提取对口语阿拉伯语的识别效果明显优于使用词干提取技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference On Social Media, Wearable And Web Analytics (Social Media)

自引率

0.00%

发文量