增强的僧伽罗语标记器

S. Y. Senanayake, K. Kariyawasam, P. Haddela
{"title":"增强的僧伽罗语标记器","authors":"S. Y. Senanayake, K. Kariyawasam, P. Haddela","doi":"10.1109/NITC48475.2019.9114420","DOIUrl":null,"url":null,"abstract":"Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.","PeriodicalId":386923,"journal":{"name":"2019 National Information Technology Conference (NITC)","volume":"460 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Enhanced Tokenizer for Sinhala Language\",\"authors\":\"S. Y. Senanayake, K. Kariyawasam, P. Haddela\",\"doi\":\"10.1109/NITC48475.2019.9114420\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.\",\"PeriodicalId\":386923,\"journal\":{\"name\":\"2019 National Information Technology Conference (NITC)\",\"volume\":\"460 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 National Information Technology Conference (NITC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NITC48475.2019.9114420\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Information Technology Conference (NITC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NITC48475.2019.9114420","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

标记化过程在自然语言处理(NLP)应用中起着重要的作用。它将内容分割成最小的有意义的单位。然而,僧伽罗语的标记化方法数量有限。apache软件库中的标准分析器和自然语言工具包(NLTK)是目前对僧伽罗语内容进行标记的主要方法。由于这些语言是独立的,所以在应用于僧伽罗语时存在一些限制。我们提出的僧伽罗语标记器主要侧重于基于标点符号的标记化。它通过识别标点符号的用例来精确地标记内容。在我们的研究中,我们已经证明了我们基于标点的标记化方法优于现有方法中的单词标记化方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Enhanced Tokenizer for Sinhala Language
Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信