{"title":"利用 BERT 嵌入、数据增强和混合编码器-CNN 架构丰富乌尔都语 NER","authors":"Anil Ahmed, Degen Huang, Syed Yasser Arafat, Imran Hameed","doi":"10.1145/3648362","DOIUrl":null,"url":null,"abstract":"<p>Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages like English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages like Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERT<sub>BASE</sub> and ELMo using Urdu Wikipedia and news articles. Secondly, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-Score of 93.99%, highlighting its efficacy for the U-NER task.</p>","PeriodicalId":54312,"journal":{"name":"ACM Transactions on Asian and Low-Resource Language Information Processing","volume":"223 1","pages":""},"PeriodicalIF":1.8000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture\",\"authors\":\"Anil Ahmed, Degen Huang, Syed Yasser Arafat, Imran Hameed\",\"doi\":\"10.1145/3648362\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages like English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages like Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERT<sub>BASE</sub> and ELMo using Urdu Wikipedia and news articles. Secondly, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-Score of 93.99%, highlighting its efficacy for the U-NER task.</p>\",\"PeriodicalId\":54312,\"journal\":{\"name\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"volume\":\"223 1\",\"pages\":\"\"},\"PeriodicalIF\":1.8000,\"publicationDate\":\"2024-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Asian and Low-Resource Language Information Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1145/3648362\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Asian and Low-Resource Language Information Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3648362","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
命名实体识别(NER)是自然语言处理(NLP)不可或缺的组成部分,旨在识别文本数据中的实体并对其进行分类。虽然深度学习(DL)模型在英语、西班牙语和中文等资源丰富的语言的 NER 中表现出色,但在处理乌尔都语等资源匮乏的语言时却面临巨大障碍。这些挑战源于乌尔都语错综复杂的语言特点,包括形态多样性、上下文相关词汇以及训练数据的稀缺性。本研究通过关注乌尔都语命名实体识别(U-NER)来解决这些问题,并引入了三个关键贡献。首先,采用了多种预训练嵌入方法,包括 Word2vec (W2V)、GloVe、FastText、来自变换器的双向编码器表示法 (BERT) 和来自语言模型的嵌入法 (ELMo)。其中,利用乌尔都语维基百科和新闻文章对 BERTBASE 和 ELMo 进行了微调。其次,一种新颖的生成性数据增强(DA)技术用掩码标记取代了命名实体(NE),利用预先训练好的掩码语言模型来预测掩码标记,从而有效地扩展了训练数据集。最后,该研究引入了一种新型混合模型,该模型结合了变换器编码器和卷积神经网络(CNN),以捕捉乌尔都语复杂的形态。这些模块使模型能够处理多义词,提取短程和长程依赖关系,并增强学习能力。实证实验表明,所提出的模型结合了 BERT 嵌入和创新的 DA 方法,达到了最高的 F1-Score 93.99%,突显了其在 U-NER 任务中的功效。
Enriching Urdu NER with BERT Embedding, Data Augmentation, and Hybrid Encoder-CNN Architecture
Named Entity Recognition (NER) is an indispensable component of Natural Language Processing (NLP), which aims to identify and classify entities within text data. While Deep Learning (DL) models have excelled in NER for well-resourced languages like English, Spanish, and Chinese, they face significant hurdles when dealing with low-resource languages like Urdu. These challenges stem from the intricate linguistic characteristics of Urdu, including morphological diversity, context-dependent lexicon, and the scarcity of training data. This study addresses these issues by focusing on Urdu Named Entity Recognition (U-NER) and introducing three key contributions. First, various pre-trained embedding methods are employed, encompassing Word2vec (W2V), GloVe, FastText, Bidirectional Encoder Representations from Transformers (BERT), and Embeddings from language models (ELMo). In particular, fine-tuning is performed on BERTBASE and ELMo using Urdu Wikipedia and news articles. Secondly, a novel generative Data Augmentation (DA) technique replaces Named Entities (NEs) with mask tokens, employing pre-trained masked language models to predict masked tokens, effectively expanding the training dataset. Finally, the study introduces a novel hybrid model combining a Transformer Encoder with a Convolutional Neural Network (CNN) to capture the intricate morphology of Urdu. These modules enable the model to handle polysemy, extract short and long-range dependencies, and enhance learning capacity. Empirical experiments demonstrate that the proposed model, incorporating BERT embeddings and an innovative DA approach, attains the highest F1-Score of 93.99%, highlighting its efficacy for the U-NER task.
期刊介绍:
The ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP) publishes high quality original archival papers and technical notes in the areas of computation and processing of information in Asian languages, low-resource languages of Africa, Australasia, Oceania and the Americas, as well as related disciplines. The subject areas covered by TALLIP include, but are not limited to:
-Computational Linguistics: including computational phonology, computational morphology, computational syntax (e.g. parsing), computational semantics, computational pragmatics, etc.
-Linguistic Resources: including computational lexicography, terminology, electronic dictionaries, cross-lingual dictionaries, electronic thesauri, etc.
-Hardware and software algorithms and tools for Asian or low-resource language processing, e.g., handwritten character recognition.
-Information Understanding: including text understanding, speech understanding, character recognition, discourse processing, dialogue systems, etc.
-Machine Translation involving Asian or low-resource languages.
-Information Retrieval: including natural language processing (NLP) for concept-based indexing, natural language query interfaces, semantic relevance judgments, etc.
-Information Extraction and Filtering: including automatic abstraction, user profiling, etc.
-Speech processing: including text-to-speech synthesis and automatic speech recognition.
-Multimedia Asian Information Processing: including speech, image, video, image/text translation, etc.
-Cross-lingual information processing involving Asian or low-resource languages.
-Papers that deal in theory, systems design, evaluation and applications in the aforesaid subjects are appropriate for TALLIP. Emphasis will be placed on the originality and the practical significance of the reported research.