WANLP@ACL 2019Pub Date : 2019-10-16DOI: 10.18653/v1/W19-4602
Ahmed Y. Tawfik, M. Emam, Khaled Essam, Robert Nabil, Hany Hassan
{"title":"Morphology-aware Word-Segmentation in Dialectal Arabic Adaptation of Neural Machine Translation","authors":"Ahmed Y. Tawfik, M. Emam, Khaled Essam, Robert Nabil, Hany Hassan","doi":"10.18653/v1/W19-4602","DOIUrl":"https://doi.org/10.18653/v1/W19-4602","url":null,"abstract":"Parallel corpora available for building machine translation (MT) models for dialectal Arabic (DA) are rather limited. The scarcity of resources has prompted the use of Modern Standard Arabic (MSA) abundant resources to complement the limited dialectal resource. However, dialectal clitics often differ between MSA and DA. This paper compares morphology-aware DA word segmentation to other word segmentation approaches like Byte Pair Encoding (BPE) and Sub-word Regularization (SR). A set of experiments conducted on Egyptian Arabic (EA), Levantine Arabic (LA), and Gulf Arabic (GA) show that a sufficiently accurate morphology-aware segmentation used in conjunction with BPE outperforms the other word segmentation approaches.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116744239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-02DOI: 10.18653/v1/W19-4620
Bushra Algotiml, AbdelRahim Elmadany, Walid Magdy
{"title":"Arabic Tweet-Act: Speech Act Recognition for Arabic Asynchronous Conversations","authors":"Bushra Algotiml, AbdelRahim Elmadany, Walid Magdy","doi":"10.18653/v1/W19-4620","DOIUrl":"https://doi.org/10.18653/v1/W19-4620","url":null,"abstract":"Speech acts are the actions that a speaker intends when performing an utterance within conversations. In this paper, we proposed speech act classification for asynchronous conversations on Twitter using multiple machine learning methods including SVM and deep neural networks. We applied the proposed methods on the ArSAS tweets dataset. The obtained results show that superiority of deep learning methods compared to SVMs, where Bi-LSTM managed to achieve an accuracy of 87.5% and a macro-averaged F1 score 61.5%. We believe that our results are the first to be reported on the task of speech-act recognition for asynchronous conversations on Arabic Twitter.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130301945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4619
Imad Zeroual, Dirk Goldhahn, Thomas Eckart, A. Lakhouaja
{"title":"OSIAN: Open Source International Arabic News Corpus - Preparation and Integration into the CLARIN-infrastructure","authors":"Imad Zeroual, Dirk Goldhahn, Thomas Eckart, A. Lakhouaja","doi":"10.18653/v1/W19-4619","DOIUrl":"https://doi.org/10.18653/v1/W19-4619","url":null,"abstract":"The World Wide Web has become a fundamental resource for building large text corpora. Broadcasting platforms such as news websites are rich sources of data regarding diverse topics and form a valuable foundation for research. The Arabic language is extensively utilized on the Web. Still, Arabic is relatively an under-resourced language in terms of availability of freely annotated corpora. This paper presents the first version of the Open Source International Arabic News (OSIAN) corpus. The corpus data was collected from international Arabic news websites, all being freely available on the Web. The corpus consists of about 3.5 million articles comprising more than 37 million sentences and roughly 1 billion tokens. It is encoded in XML; each article is annotated with metadata information. Moreover, each word is annotated with lemma and part-of-speech. the described corpus is processed, archived and published into the CLARIN infrastructure. This publication includes descriptive metadata via OAI-PMH, direct access to the plain text material (available under Creative Commons Attribution-Non-Commercial 4.0 International License - CC BY-NC 4.0), and integration into the WebLicht annotation platform and CLARIN’s Federated Content Search FCS.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117304691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4626
Youssef Fares, Zeyad El-Zanaty, K. Abdel-Salam, Muhammed Ezzeldin, Aliaa Mohamed, Karim El-Awaad, Marwan Torki
{"title":"Arabic Dialect Identification with Deep Learning and Hybrid Frequency Based Features","authors":"Youssef Fares, Zeyad El-Zanaty, K. Abdel-Salam, Muhammed Ezzeldin, Aliaa Mohamed, Karim El-Awaad, Marwan Torki","doi":"10.18653/v1/W19-4626","DOIUrl":"https://doi.org/10.18653/v1/W19-4626","url":null,"abstract":"Studies on Dialectical Arabic are growing more important by the day as it becomes the primary written and spoken form of Arabic online in informal settings. Among the important problems that should be explored is that of dialect identification. This paper reports different techniques that can be applied towards such goal and reports their performance on the Multi Arabic Dialect Applications and Resources (MADAR) Arabic Dialect Corpora. Our results show that improving on traditional systems using frequency based features and non deep learning classifiers is a challenging task. We propose different models based on different word and document representations. Our top model is able to achieve an F1 macro averaged score of 65.66 on MADAR’s small-scale parallel corpus of 25 dialects and Modern Standard Arabic (MSA).","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130236348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4631
Gael de Francony, Victor Guichard, Praveen Joshi, Haithem Afli, Abdessalam Bouchekif
{"title":"Hierarchical Deep Learning for Arabic Dialect Identification","authors":"Gael de Francony, Victor Guichard, Praveen Joshi, Haithem Afli, Abdessalam Bouchekif","doi":"10.18653/v1/W19-4631","DOIUrl":"https://doi.org/10.18653/v1/W19-4631","url":null,"abstract":"In this paper, we present two approaches for Arabic Fine-Grained Dialect Identification. The first approach is based on Recurrent Neural Networks (BLSTM, BGRU) using hierarchical classification. The main idea is to separate the classification process for a sentence from a given text in two stages. We start with a higher level of classification (8 classes) and then the finer-grained classification (26 classes). The second approach is given by a voting system based on Naive Bayes and Random Forest. Our system achieves an F1 score of 63.02 % on the subtask evaluation dataset.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133580752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4629
Bashar Talafha, Wael Farhan, Ahmed Altakrouri, Hussein T. Al-Natsheh
{"title":"Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification","authors":"Bashar Talafha, Wael Farhan, Ahmed Altakrouri, Hussein T. Al-Natsheh","doi":"10.18653/v1/W19-4629","DOIUrl":"https://doi.org/10.18653/v1/W19-4629","url":null,"abstract":"Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outperform any complex deep learning model given a set of curated features. With a relatively complex user voting mechanism, we were able to achieve a Macro-Averaged F1-score of 71.84% on MADAR shared subtask-2. Our best submitted model ranked second out of all participating teams.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120842874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4610
Ahmed El-Kishky, Xingyu Fu, Aseel Addawood, N. Sobh, Clare R. Voss, Jiawei Han
{"title":"Constrained Sequence-to-sequence Semitic Root Extraction for Enriching Word Embeddings","authors":"Ahmed El-Kishky, Xingyu Fu, Aseel Addawood, N. Sobh, Clare R. Voss, Jiawei Han","doi":"10.18653/v1/W19-4610","DOIUrl":"https://doi.org/10.18653/v1/W19-4610","url":null,"abstract":"In this paper, we tackle the problem of “root extraction” from words in the Semitic language family. A challenge in applying natural language processing techniques to these languages is the data sparsity problem that arises from their rich internal morphology, where the substructure is inherently non-concatenative and morphemes are interdigitated in word formation. While previous automated methods have relied on human-curated rules or multiclass classification, they have not fully leveraged the various combinations of regular, sequential concatenative morphology within the words and the internal interleaving within templatic stems of roots and patterns. To address this, we propose a constrained sequence-to-sequence root extraction method. Experimental results show our constrained model outperforms a variety of methods at root extraction. Furthermore, by enriching word embeddings with resulting decompositions, we show improved results on word analogy, word similarity, and language modeling tasks.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127335818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4604
Hala Mulki, Hatem Haddad, Mourad Gridach, Ismail Babaoglu
{"title":"Syntax-Ignorant N-gram Embeddings for Sentiment Analysis of Arabic Dialects","authors":"Hala Mulki, Hatem Haddad, Mourad Gridach, Ismail Babaoglu","doi":"10.18653/v1/W19-4604","DOIUrl":"https://doi.org/10.18653/v1/W19-4604","url":null,"abstract":"Arabic sentiment analysis models have employed compositional embedding features to represent the Arabic dialectal content. These embeddings are usually composed via ordered, syntax-aware composition functions and learned within deep neural frameworks. With the free word order and the varying syntax nature across the different Arabic dialects, a sentiment analysis system developed for one dialect might not be efficient for the others. Here we present syntax-ignorant n-gram embeddings to be used in sentiment analysis of several Arabic dialects. The proposed embeddings were composed and learned using an unordered composition function and a shallow neural model. Five datasets of different dialects were used to evaluate the produced embeddings in the sentiment analysis task. The obtained results revealed that, our syntax-ignorant embeddings could outperform word2vec model and doc2vec both variant models in addition to hand-crafted system baselines, while a competent performance was noticed towards baseline systems that adopted more complicated neural architectures.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127843305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4638
Bashar Talafha, A. Fadel, M. Al-Ayyoub, Y. Jararweh, Mohammad Al-Smadi, P. Juola
{"title":"Team JUST at the MADAR Shared Task on Arabic Fine-Grained Dialect Identification","authors":"Bashar Talafha, A. Fadel, M. Al-Ayyoub, Y. Jararweh, Mohammad Al-Smadi, P. Juola","doi":"10.18653/v1/W19-4638","DOIUrl":"https://doi.org/10.18653/v1/W19-4638","url":null,"abstract":"In this paper, we describe our team’s effort on the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. The task requires building a system capable of differentiating between 25 different Arabic dialects in addition to MSA. Our approach is simple. After preprocessing the data, we use Data Augmentation (DA) to enlarge the training data six times. We then build a language model and extract n-gram word-level and character-level TF-IDF features and feed them into an MNB classifier. Despite its simplicity, the resulting model performs really well producing the 4th highest F-measure and region-level accuracy and the 5th highest precision, recall, city-level accuracy and country-level accuracy among the participating teams.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123330389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4608
Obeida ElJundi, Wissam Antoun, Nour El Droubi, Hazem M. Hajj, W. El-Hajj, K. Shaban
{"title":"hULMonA: The Universal Language Model in Arabic","authors":"Obeida ElJundi, Wissam Antoun, Nour El Droubi, Hazem M. Hajj, W. El-Hajj, K. Shaban","doi":"10.18653/v1/W19-4608","DOIUrl":"https://doi.org/10.18653/v1/W19-4608","url":null,"abstract":"Arabic is a complex language with limited resources which makes it challenging to produce accurate text classification tasks such as sentiment analysis. The utilization of transfer learning (TL) has recently shown promising results for advancing accuracy of text classification in English. TL models are pre-trained on large corpora, and then fine-tuned on task-specific datasets. In particular, universal language models (ULMs), such as recently developed BERT, have achieved state-of-the-art results in various NLP tasks in English. In this paper, we hypothesize that similar success can be achieved for Arabic. The work aims at supporting the hypothesis by developing the first Universal Language Model in Arabic (hULMonA - حلمنا meaning our dream), demonstrating its use for Arabic classifications tasks, and demonstrating how a pre-trained multi-lingual BERT can also be used for Arabic. We then conduct a benchmark study to evaluate both ULM successes with Arabic sentiment analysis. Experiment results show that the developed hULMonA and multi-lingual ULM are able to generalize well to multiple Arabic data sets and achieve new state of the art results in Arabic Sentiment Analysis for some of the tested sets.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"206 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122598439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}