{"title":"Word-level Language Identification and Localization in Code-Mixed Urdu-English Text","authors":"Eysha Raazia, Amina Bibi, Muhammad Umair Arshad","doi":"10.1109/ICOSST57195.2022.10016848","DOIUrl":null,"url":null,"abstract":"Language Identification is significant for most Natural Language Processing (NLP) tasks to work precisely. Language Identification is still very challenging because of the range of dialects. The major challenge in Language Identification (LID) task is the lack of availability of tools for understanding the context of multiple languages. We proposed a deep learning neural network Bi-LSTM CNN for word-level classification for Language Identification (LID) and localization of Roman Urdu and English in the code-switch text in this paper. We utilized the dataset of code-switch text having variant spellings of the same Roman Urdu words, generated from different social media platforms as they are a rich source of code-switch languages. We used GoogleNews Word2Vec Vectorizer for word embeddings. The embedding layer is followed by the Bidirectional long-short term memory (Bi-LSTM) layers along with the Convolutional Neural Network (CNN). We experimented with the dataset on different variations of LSTM and CNN to achieve the best possible results. We achieved 90.40% accuracy and a 90.39% F1 score.","PeriodicalId":238082,"journal":{"name":"2022 16th International Conference on Open Source Systems and Technologies (ICOSST)","volume":"24 11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 16th International Conference on Open Source Systems and Technologies (ICOSST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICOSST57195.2022.10016848","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Language Identification is significant for most Natural Language Processing (NLP) tasks to work precisely. Language Identification is still very challenging because of the range of dialects. The major challenge in Language Identification (LID) task is the lack of availability of tools for understanding the context of multiple languages. We proposed a deep learning neural network Bi-LSTM CNN for word-level classification for Language Identification (LID) and localization of Roman Urdu and English in the code-switch text in this paper. We utilized the dataset of code-switch text having variant spellings of the same Roman Urdu words, generated from different social media platforms as they are a rich source of code-switch languages. We used GoogleNews Word2Vec Vectorizer for word embeddings. The embedding layer is followed by the Bidirectional long-short term memory (Bi-LSTM) layers along with the Convolutional Neural Network (CNN). We experimented with the dataset on different variations of LSTM and CNN to achieve the best possible results. We achieved 90.40% accuracy and a 90.39% F1 score.