Sequence-driven Neural Network models for NER Tagging in Roman Urdu

2022 International Conference on Frontiers of Information Technology (FIT) Pub Date : 2022-12-01 DOI:10.1109/FIT57066.2022.00040

Maaz Ali Nadeem, Khadija Irfan, Khaula Atiq, M. O. Beg, Muhammad Umair Arshad

{"title":"Sequence-driven Neural Network models for NER Tagging in Roman Urdu","authors":"Maaz Ali Nadeem, Khadija Irfan, Khaula Atiq, M. O. Beg, Muhammad Umair Arshad","doi":"10.1109/FIT57066.2022.00040","DOIUrl":null,"url":null,"abstract":"Modern Natural Language Processing research has taken a flight as it moves to address the issues of mapping contextual sequence labeling for low-resource languages. Named-Entity Recognition is one such labeling application; where text is considered contextually and labeled with the named entities. NER for Roman Urdu aims to achieve tasks such as Information Extraction, Machine Translation, and even big data operations on live digital content. There has been limited research on such NLP applications in Roman Urdu, however, work on Urdu and other languages of the family encourage active research. This paper holds comparisons using a few deep learning-based models that learn the importance of word classification by mapping to a specific context based on placement. Our model is trained on a hand-annotated corpus covering several domains. After a detailed comparison and evaluation, Bi-LSTM yields an exceptional F1-score of 82.7%. Our work demonstrates the possibility of long-range contextual understanding for processing morphologically rich low-resource languages.","PeriodicalId":102958,"journal":{"name":"2022 International Conference on Frontiers of Information Technology (FIT)","volume":"174 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT57066.2022.00040","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Modern Natural Language Processing research has taken a flight as it moves to address the issues of mapping contextual sequence labeling for low-resource languages. Named-Entity Recognition is one such labeling application; where text is considered contextually and labeled with the named entities. NER for Roman Urdu aims to achieve tasks such as Information Extraction, Machine Translation, and even big data operations on live digital content. There has been limited research on such NLP applications in Roman Urdu, however, work on Urdu and other languages of the family encourage active research. This paper holds comparisons using a few deep learning-based models that learn the importance of word classification by mapping to a specific context based on placement. Our model is trained on a hand-annotated corpus covering several domains. After a detailed comparison and evaluation, Bi-LSTM yields an exceptional F1-score of 82.7%. Our work demonstrates the possibility of long-range contextual understanding for processing morphologically rich low-resource languages.

查看原文本刊更多论文

罗马乌尔都语NER标注的序列驱动神经网络模型

现代自然语言处理研究在解决低资源语言的映射上下文序列标记问题方面取得了长足的进步。命名实体识别就是这样一种标签应用;其中文本根据上下文进行考虑，并使用命名实体进行标记。罗马乌尔都语NER旨在实现实时数字内容的信息提取、机器翻译甚至大数据操作等任务。在罗马乌尔都语中对这种自然语言处理应用的研究有限，然而，乌尔都语和其他家庭语言的工作鼓励积极的研究。本文使用几个基于深度学习的模型进行比较，这些模型通过基于位置映射到特定上下文来学习单词分类的重要性。我们的模型是在覆盖多个领域的手工标注语料库上训练的。经过详细的比较和评价，Bi-LSTM的f1得分达到了82.7%。我们的工作证明了远程上下文理解处理形态学丰富的低资源语言的可能性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Frontiers of Information Technology (FIT)

自引率

0.00%

发文量