Conditional Random Fields based Named Entity Recognition for Sinhala

K. Senevirathne, N. Attanayake, A. W. M. H. Dhananjanie, W. A. S. U. Weragoda, A. Nugaliyadde, S. Thelijjagoda
{"title":"Conditional Random Fields based Named Entity Recognition for Sinhala","authors":"K. Senevirathne, N. Attanayake, A. W. M. H. Dhananjanie, W. A. S. U. Weragoda, A. Nugaliyadde, S. Thelijjagoda","doi":"10.1109/ICIINFS.2015.7399028","DOIUrl":null,"url":null,"abstract":"Named Entity Recognition (NER) plays an important role in Natural Language Processing (NLP). Named Entities (NEs) are special atomic elements in natural languages belonging to predefined categories such as persons, organizations, locations, expressions of times, quantities, monetary values and percentages etc. These are referring to specific things and not listed in grammar or lexicons. NER is the task of identifying such NEs. This is a task entwined with number of challenges. Entities may be difficult to find at first, and once found, difficult to classify. For instance, locations and person names can be the same, and follow similar formatting. This becomes tough when it comes to South and South East Asian languages. That is mainly due to the nature of these languages. Even though Latin languages have accurate NER solutions those cannot be directly applied for Indic languages, because the features found in those languages are different from English. Therefore the research was based on producing a mathematical model which acts as the integral part of the Sinhala NER system. The researchers used Sinhala News corpus as the data set to train the Conditional Random Fields (CRFs) algorithm. 90% of the corpus was used in training the model, 10% is used in testing the resulted model. The research makes use of orthographic word-level features along with contextual information, which are helpful in predicting three different NE classes namely Persons, Locations and Organizations. The findings of the research were applied in developing the NE Annotator which identified NE classes from unstructured Sinhala text. The prominent contribution of this research for NER could benefit Sinhala NLP application developers and NLP related researchers in near future.","PeriodicalId":174378,"journal":{"name":"2015 IEEE 10th International Conference on Industrial and Information Systems (ICIIS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 10th International Conference on Industrial and Information Systems (ICIIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIINFS.2015.7399028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Named Entity Recognition (NER) plays an important role in Natural Language Processing (NLP). Named Entities (NEs) are special atomic elements in natural languages belonging to predefined categories such as persons, organizations, locations, expressions of times, quantities, monetary values and percentages etc. These are referring to specific things and not listed in grammar or lexicons. NER is the task of identifying such NEs. This is a task entwined with number of challenges. Entities may be difficult to find at first, and once found, difficult to classify. For instance, locations and person names can be the same, and follow similar formatting. This becomes tough when it comes to South and South East Asian languages. That is mainly due to the nature of these languages. Even though Latin languages have accurate NER solutions those cannot be directly applied for Indic languages, because the features found in those languages are different from English. Therefore the research was based on producing a mathematical model which acts as the integral part of the Sinhala NER system. The researchers used Sinhala News corpus as the data set to train the Conditional Random Fields (CRFs) algorithm. 90% of the corpus was used in training the model, 10% is used in testing the resulted model. The research makes use of orthographic word-level features along with contextual information, which are helpful in predicting three different NE classes namely Persons, Locations and Organizations. The findings of the research were applied in developing the NE Annotator which identified NE classes from unstructured Sinhala text. The prominent contribution of this research for NER could benefit Sinhala NLP application developers and NLP related researchers in near future.
基于条件随机场的僧伽罗语命名实体识别
命名实体识别(NER)在自然语言处理(NLP)中起着重要的作用。命名实体(NEs)是自然语言中特殊的原子元素,属于预定义的类别,如人员、组织、地点、时间、数量、货币价值和百分比等。这些词指的是特定的东西,不在语法或词典中列出。NER是识别这些网元的任务。这是一项与许多挑战交织在一起的任务。实体一开始可能很难找到,一旦找到,就很难分类。例如,位置和人名可以是相同的,并遵循类似的格式。当涉及到南亚和东南亚语言时,这就变得困难了。这主要是由于这些语言的性质。尽管拉丁语言有精确的NER解决方案,但这些解决方案不能直接应用于印度语言,因为这些语言的特征与英语不同。因此,本研究的基础是建立一个数学模型,作为僧伽罗语NER系统的组成部分。研究人员使用僧伽罗语新闻语料库作为训练条件随机场(CRFs)算法的数据集。90%的语料库用于训练模型,10%用于测试生成的模型。该研究利用正字法词级特征和上下文信息,有助于预测三种不同的NE类别,即人物、地点和组织。研究结果应用于开发NE注释器,该注释器从非结构化僧伽罗文本中识别NE类。本研究对NER的突出贡献可以在不久的将来为僧伽罗语NLP应用开发人员和NLP相关研究人员提供帮助。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信