WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4628
Pruthwik Mishra, Vandan Mujadia
{"title":"Arabic Dialect Identification for Travel and Twitter Text","authors":"Pruthwik Mishra, Vandan Mujadia","doi":"10.18653/v1/W19-4628","DOIUrl":"https://doi.org/10.18653/v1/W19-4628","url":null,"abstract":"This paper presents the results of the experiments done as a part of MADAR Shared Task in WANLP 2019 on Arabic Fine-Grained Dialect Identification. Dialect Identification is one of the prominent tasks in the field of Natural language processing where the subsequent language modules can be improved based on it. We explored the use of different features like char, word n-gram, language model probabilities, etc on different classifiers. Results show that these features help to improve dialect classification accuracy. Results also show that traditional machine learning classifier tends to perform better when compared to neural network models on this task in a low resource setting.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116497924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4634
Thomas Lippincott, Pamela Shapiro, Kevin Duh, Paul McNamee
{"title":"JHU System Description for the MADAR Arabic Dialect Identification Shared Task","authors":"Thomas Lippincott, Pamela Shapiro, Kevin Duh, Paul McNamee","doi":"10.18653/v1/W19-4634","DOIUrl":"https://doi.org/10.18653/v1/W19-4634","url":null,"abstract":"Our submission to the MADAR shared task on Arabic dialect identification employed a language modeling technique called Prediction by Partial Matching, an ensemble of neural architectures, and sources of additional data for training word embeddings and auxiliary language models. We found several of these techniques provided small boosts in performance, though a simple character-level language model was a strong baseline, and a lower-order LM achieved best performance on Subtask 2. Interestingly, word embeddings provided no consistent benefit, and ensembling struggled to outperform the best component submodel. This suggests the variety of architectures are learning redundant information, and future work may focus on encouraging decorrelated learning.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124647603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4606
Sawsan Alqahtani, Hanan Aldarmaki, Mona T. Diab
{"title":"Homograph Disambiguation through Selective Diacritic Restoration","authors":"Sawsan Alqahtani, Hanan Aldarmaki, Mona T. Diab","doi":"10.18653/v1/W19-4606","DOIUrl":"https://doi.org/10.18653/v1/W19-4606","url":null,"abstract":"Lexical ambiguity, a challenging phenomenon in all natural languages, is particularly prevalent for languages with diacritics that tend to be omitted in writing, such as Arabic. Omitting diacritics leads to an increase in the number of homographs: different words with the same spelling. Diacritic restoration could theoretically help disambiguate these words, but in practice, the increase in overall sparsity leads to performance degradation in NLP applications. In this paper, we propose approaches for automatically marking a subset of words for diacritic restoration, which leads to selective homograph disambiguation. Compared to full or no diacritic restoration, these approaches yield selectively-diacritized datasets that balance sparsity and lexical disambiguation. We evaluate the various selection strategies extrinsically on several downstream applications: neural machine translation, part-of-speech tagging, and semantic textual similarity. Our experiments on Arabic show promising results, where our devised strategies on selective diacritization lead to a more balanced and consistent performance in downstream applications.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127749637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-08-01DOI: 10.18653/v1/W19-4637
Chiyu Zhang, Muhammad Abdul-Mageed
{"title":"No Army, No Navy: BERT Semi-Supervised Learning of Arabic Dialects","authors":"Chiyu Zhang, Muhammad Abdul-Mageed","doi":"10.18653/v1/W19-4637","DOIUrl":"https://doi.org/10.18653/v1/W19-4637","url":null,"abstract":"We present our deep leaning system submitted to MADAR shared task 2 focused on twitter user dialect identification. We develop tweet-level identification models based on GRUs and BERT in supervised and semi-supervised set-tings. We then introduce a simple, yet effective, method of porting tweet-level labels at the level of users. Our system ranks top 1 in the competition, with 71.70% macro F1 score and 77.40% accuracy.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122285904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-07-28DOI: 10.18653/v1/W19-4605
Raki Lachraf, El Moatez Billah Nagoudi, Youcef Ayachi, Ahmed Abdelali, D. Schwab
{"title":"ArbEngVec : Arabic-English Cross-Lingual Word Embedding Model","authors":"Raki Lachraf, El Moatez Billah Nagoudi, Youcef Ayachi, Ahmed Abdelali, D. Schwab","doi":"10.18653/v1/W19-4605","DOIUrl":"https://doi.org/10.18653/v1/W19-4605","url":null,"abstract":"Word Embeddings (WE) are getting increasingly popular and widely applied in many Natural Language Processing (NLP) applications due to their effectiveness in capturing semantic properties of words; Machine Translation (MT), Information Retrieval (IR) and Information Extraction (IE) are among such areas. In this paper, we propose an open source ArbEngVec which provides several Arabic-English cross-lingual word embedding models. To train our bilingual models, we use a large dataset with more than 93 million pairs of Arabic-English parallel sentences. In addition, we perform both extrinsic and intrinsic evaluations for the different word embedding model variants. The extrinsic evaluation assesses the performance of models on the cross-language Semantic Textual Similarity (STS), while the intrinsic evaluation is based on the Word Translation (WT) task.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130189006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 2019-05-24DOI: 10.18653/v1/W19-4621
Ibrahim Abu Farha, Walid Magdy
{"title":"Mazajak: An Online Arabic Sentiment Analyser","authors":"Ibrahim Abu Farha, Walid Magdy","doi":"10.18653/v1/W19-4621","DOIUrl":"https://doi.org/10.18653/v1/W19-4621","url":null,"abstract":"Sentiment analysis (SA) is one of the most useful natural language processing applications. Literature is flooding with many papers and systems addressing this task, but most of the work is focused on English. In this paper, we present “Mazajak”, an online system for Arabic SA. The system is based on a deep learning model, which achieves state-of-the-art results on many Arabic dialect datasets including SemEval 2017 and ASTD. The availability of such system should assist various applications and research that rely on sentiment analysis as a tool.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134535342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 1900-01-01DOI: 10.18653/v1/W19-4636
Mohamed S. Elaraby, A. Zahran
{"title":"A Character Level Convolutional BiLSTM for Arabic Dialect Identification","authors":"Mohamed S. Elaraby, A. Zahran","doi":"10.18653/v1/W19-4636","DOIUrl":"https://doi.org/10.18653/v1/W19-4636","url":null,"abstract":"In this paper, we describe CU-RAISA teamcontribution to the 2019Madar shared task2, which focused on Twitter User fine-grained dialect identification.Among par-ticipating teams, our system ranked the4th(with 61.54%) F1-Macro measure.Our sys-tem is trained using a character level convo-lutional bidirectional long-short-term memorynetwork trained on 2k users’ data. We showthat training on concatenated user tweets asinput is further superior to training on usertweets separately and assign user’s label on themode of user’s tweets’ predictions.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131166097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
WANLP@ACL 2019Pub Date : 1900-01-01DOI: 10.18653/v1/W19-4623
P. Pribán, Stephen Eugene Taylor
{"title":"ZCU-NLP at MADAR 2019: Recognizing Arabic Dialects","authors":"P. Pribán, Stephen Eugene Taylor","doi":"10.18653/v1/W19-4623","DOIUrl":"https://doi.org/10.18653/v1/W19-4623","url":null,"abstract":"In this paper, we present our systems for the MADAR Shared Task: Arabic Fine-Grained Dialect Identification. The shared task consists of two subtasks. The goal of Subtask– 1 (S-1) is to detect an Arabic city dialect in a given text and the goal of Subtask–2 (S-2) is to predict the country of origin of a Twitter user by using tweets posted by the user. In S-1, our proposed systems are based on language modelling. We use language models to extract features that are later used as an input for other machine learning algorithms. We also experiment with recurrent neural networks (RNN), but these experiments showed that simpler machine learning algorithms are more successful. Our system achieves 0.658 macro F1-score and our rank is 6th out of 19 teams in S-1 and 7th in S-2 with 0.475 macro F1-score.","PeriodicalId":268163,"journal":{"name":"WANLP@ACL 2019","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128891588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}