增强的僧伽罗语标记器

2019 National Information Technology Conference (NITC) Pub Date : 2019-10-01 DOI:10.1109/NITC48475.2019.9114420

S. Y. Senanayake, K. Kariyawasam, P. Haddela

{"title":"增强的僧伽罗语标记器","authors":"S. Y. Senanayake, K. Kariyawasam, P. Haddela","doi":"10.1109/NITC48475.2019.9114420","DOIUrl":null,"url":null,"abstract":"Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.","PeriodicalId":386923,"journal":{"name":"2019 National Information Technology Conference (NITC)","volume":"460 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Enhanced Tokenizer for Sinhala Language\",\"authors\":\"S. Y. Senanayake, K. Kariyawasam, P. Haddela\",\"doi\":\"10.1109/NITC48475.2019.9114420\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.\",\"PeriodicalId\":386923,\"journal\":{\"name\":\"2019 National Information Technology Conference (NITC)\",\"volume\":\"460 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 National Information Technology Conference (NITC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NITC48475.2019.9114420\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Information Technology Conference (NITC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NITC48475.2019.9114420","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

标记化过程在自然语言处理(NLP)应用中起着重要的作用。它将内容分割成最小的有意义的单位。然而，僧伽罗语的标记化方法数量有限。apache软件库中的标准分析器和自然语言工具包(NLTK)是目前对僧伽罗语内容进行标记的主要方法。由于这些语言是独立的，所以在应用于僧伽罗语时存在一些限制。我们提出的僧伽罗语标记器主要侧重于基于标点符号的标记化。它通过识别标点符号的用例来精确地标记内容。在我们的研究中，我们已经证明了我们基于标点的标记化方法优于现有方法中的单词标记化方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enhanced Tokenizer for Sinhala Language

Tokenization process plays a prominent role in natural language processing (NLP) applications. It chops the content into the smallest meaningful units. However, there is a limited number of tokenization approaches for Sinhala language. Standard analyzer in apache software library and natural language toolkit (NLTK) are the main existing approaches to tokenize Sinhala language content. Since these are language independent, there are some limitations when it applies to Sinhala. Our proposed Sinhala tokenizer is mainly focusing on punctuation-based tokenization. It precisely tokenizes the content by identifying the use case of punctuation mark. In our research, we have proved that our punctuation-based tokenization approach outperforms the word tokenization in existing approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 National Information Technology Conference (NITC)

自引率

0.00%

发文量