Dynamic Stopword Removal for Sinhala Language

A.A.V.A Jayaweera, Y.N Senanayake, P. Haddela
{"title":"Dynamic Stopword Removal for Sinhala Language","authors":"A.A.V.A Jayaweera, Y.N Senanayake, P. Haddela","doi":"10.1109/NITC48475.2019.9114476","DOIUrl":null,"url":null,"abstract":"In the modern era of information retrieval, text summarization, text analytics, extraction of redundant (noise) words that contain a little information with low or no semantic meaning must be filtered out. Such words are known as stopwords. There are more than 40 languages which have identified their language specific stopwords. Most researchers use various techniques to identify their language specific stopword lists. But most of them try to define a magical cut-off point to the list, which they identify without any proof. In this research, the focus is to prove that the cut-off point depends on the source data and the machine learning algorithm, which will be proved by using Newton's iteration method of root finding algorithm. To achieve this, the research focuses on creating a stopword list for Sinhala language using the term frequency-based method by processing more than 90000 Sinhala documents. This paper presents the results received and new datasets prepared for text preprocessing.","PeriodicalId":386923,"journal":{"name":"2019 National Information Technology Conference (NITC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 National Information Technology Conference (NITC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NITC48475.2019.9114476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

In the modern era of information retrieval, text summarization, text analytics, extraction of redundant (noise) words that contain a little information with low or no semantic meaning must be filtered out. Such words are known as stopwords. There are more than 40 languages which have identified their language specific stopwords. Most researchers use various techniques to identify their language specific stopword lists. But most of them try to define a magical cut-off point to the list, which they identify without any proof. In this research, the focus is to prove that the cut-off point depends on the source data and the machine learning algorithm, which will be proved by using Newton's iteration method of root finding algorithm. To achieve this, the research focuses on creating a stopword list for Sinhala language using the term frequency-based method by processing more than 90000 Sinhala documents. This paper presents the results received and new datasets prepared for text preprocessing.
僧伽罗语动态停词去除
在现代信息检索时代,文本摘要、文本分析、冗余(噪声)词的提取是必须过滤掉的,这些冗余(噪声)词包含的信息很少,语义很低或没有语义。这样的词被称为停顿词。有超过40种语言已经确定了他们的语言特有的停止词。大多数研究人员使用各种技术来确定他们的语言特定的停词列表。但他们中的大多数人试图在没有任何证据的情况下,为名单定义一个神奇的分界点。在本研究中,重点是证明截止点依赖于源数据和机器学习算法,这将通过牛顿寻根算法的迭代方法来证明。为了实现这一目标,本研究通过对9万多份僧伽罗语文档的处理,重点采用基于词频的方法创建僧伽罗语停词列表。本文介绍了收到的结果和为文本预处理准备的新数据集。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信