DanfeNER——尼泊尔语推文中的命名实体识别

Nobal B. Niraula, Jeevan Chapagain
{"title":"DanfeNER——尼泊尔语推文中的命名实体识别","authors":"Nobal B. Niraula, Jeevan Chapagain","doi":"10.32473/flairs.36.133384","DOIUrl":null,"url":null,"abstract":"Twitter allows users to easily post tweets on any subject or event anytime, generating massive amounts of rich text content on diverse topics. Automated methods such as Named Entity Recognition (NER) are required to process the massive tweet data. Processing tweets, however, poses a special challenge as they are informal posts with incomplete context and often contain acronyms, hashtags, misspellings, abbreviations, and URLs due to length constraints. This paper presents the first systematic study of NER in Nepali tweets corresponding to five different entity types: Person Name (PER), Location (LOC), Organization (ORG), Date (DAT), and Event (EVT). We develop DanfeNER, the first human-labeled high-quality NER benchmark data sets for the low-resource language Nepali. DanfeNER contains 5,366 records and 3,463 entities in its train set and 2,301 records and 1,503 entities in its test set. Using this data set, we benchmark several state-of-the-art Nepali monolingual and multilingual transformer models, obtaining micro-averaged F1 scores up to 81%.","PeriodicalId":302103,"journal":{"name":"The International FLAIRS Conference Proceedings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DanfeNER - Named Entity Recognition in Nepali Tweets\",\"authors\":\"Nobal B. Niraula, Jeevan Chapagain\",\"doi\":\"10.32473/flairs.36.133384\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Twitter allows users to easily post tweets on any subject or event anytime, generating massive amounts of rich text content on diverse topics. Automated methods such as Named Entity Recognition (NER) are required to process the massive tweet data. Processing tweets, however, poses a special challenge as they are informal posts with incomplete context and often contain acronyms, hashtags, misspellings, abbreviations, and URLs due to length constraints. This paper presents the first systematic study of NER in Nepali tweets corresponding to five different entity types: Person Name (PER), Location (LOC), Organization (ORG), Date (DAT), and Event (EVT). We develop DanfeNER, the first human-labeled high-quality NER benchmark data sets for the low-resource language Nepali. DanfeNER contains 5,366 records and 3,463 entities in its train set and 2,301 records and 1,503 entities in its test set. Using this data set, we benchmark several state-of-the-art Nepali monolingual and multilingual transformer models, obtaining micro-averaged F1 scores up to 81%.\",\"PeriodicalId\":302103,\"journal\":{\"name\":\"The International FLAIRS Conference Proceedings\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-05-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International FLAIRS Conference Proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.32473/flairs.36.133384\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International FLAIRS Conference Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.32473/flairs.36.133384","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

Twitter允许用户在任何时间轻松发布任何主题或事件的tweet,从而在不同主题上生成大量丰富的文本内容。需要使用命名实体识别(NER)等自动化方法来处理大量的推文数据。然而,处理tweet带来了特殊的挑战,因为它们是非正式的帖子,上下文不完整,并且由于长度限制,通常包含首字母缩略词、标签、拼写错误、缩写和url。本文首次系统研究了尼泊尔语推文中对应五种不同实体类型的NER:人名(PER)、地点(LOC)、组织(ORG)、日期(DAT)和事件(EVT)。我们开发了DanfeNER,这是第一个针对资源匮乏的尼泊尔语的人工标记的高质量NER基准数据集。DanfeNER在其训练集中包含5366条记录和3463个实体,在其测试集中包含2301条记录和1503个实体。利用该数据集,我们对几种最先进的尼泊尔单语和多语转换器模型进行了基准测试,获得了高达81%的微平均F1分数。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
DanfeNER - Named Entity Recognition in Nepali Tweets
Twitter allows users to easily post tweets on any subject or event anytime, generating massive amounts of rich text content on diverse topics. Automated methods such as Named Entity Recognition (NER) are required to process the massive tweet data. Processing tweets, however, poses a special challenge as they are informal posts with incomplete context and often contain acronyms, hashtags, misspellings, abbreviations, and URLs due to length constraints. This paper presents the first systematic study of NER in Nepali tweets corresponding to five different entity types: Person Name (PER), Location (LOC), Organization (ORG), Date (DAT), and Event (EVT). We develop DanfeNER, the first human-labeled high-quality NER benchmark data sets for the low-resource language Nepali. DanfeNER contains 5,366 records and 3,463 entities in its train set and 2,301 records and 1,503 entities in its test set. Using this data set, we benchmark several state-of-the-art Nepali monolingual and multilingual transformer models, obtaining micro-averaged F1 scores up to 81%.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信