Dynamic Stopword Removal for Sinhala Language

2019 National Information Technology Conference (NITC) Pub Date : 2019-10-01 DOI:10.1109/NITC48475.2019.9114476

A.A.V.A Jayaweera, Y.N Senanayake, P. Haddela

引用次数: 4

Abstract

In the modern era of information retrieval, text summarization, text analytics, extraction of redundant (noise) words that contain a little information with low or no semantic meaning must be filtered out. Such words are known as stopwords. There are more than 40 languages which have identified their language specific stopwords. Most researchers use various techniques to identify their language specific stopword lists. But most of them try to define a magical cut-off point to the list, which they identify without any proof. In this research, the focus is to prove that the cut-off point depends on the source data and the machine learning algorithm, which will be proved by using Newton's iteration method of root finding algorithm. To achieve this, the research focuses on creating a stopword list for Sinhala language using the term frequency-based method by processing more than 90000 Sinhala documents. This paper presents the results received and new datasets prepared for text preprocessing.

查看原文本刊更多论文

僧伽罗语动态停词去除

在现代信息检索时代，文本摘要、文本分析、冗余(噪声)词的提取是必须过滤掉的，这些冗余(噪声)词包含的信息很少，语义很低或没有语义。这样的词被称为停顿词。有超过40种语言已经确定了他们的语言特有的停止词。大多数研究人员使用各种技术来确定他们的语言特定的停词列表。但他们中的大多数人试图在没有任何证据的情况下，为名单定义一个神奇的分界点。在本研究中，重点是证明截止点依赖于源数据和机器学习算法，这将通过牛顿寻根算法的迭代方法来证明。为了实现这一目标，本研究通过对9万多份僧伽罗语文档的处理，重点采用基于词频的方法创建僧伽罗语停词列表。本文介绍了收到的结果和为文本预处理准备的新数据集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 National Information Technology Conference (NITC)

自引率

0.00%

发文量