古吉拉特语文本的自动停顿词识别技术

2021 International Conference on Artificial Intelligence and Machine Vision (AIMV) Pub Date : 2021-09-24 DOI:10.1109/aimv53313.2021.9670968

Dhara J. Ladani, NIKITA PARITOSH DESAI

{"title":"古吉拉特语文本的自动停顿词识别技术","authors":"Dhara J. Ladani, NIKITA PARITOSH DESAI","doi":"10.1109/aimv53313.2021.9670968","DOIUrl":null,"url":null,"abstract":"Natural Language Processing (NLP) is an Artificially Intelligent (AI) mechanism that allows computers to intelligently analyze, comprehend, and derive meaning from human language. In natural language text processing, common words like ‘a’, ‘the’, ‘is’, ‘an’, etc. are known as a stopwords. They are typically considered having no informative value. It is proved that one of the major benefits of removing stopword in NLP text-based processing is the reduction of the text in the corpus by 35 - 45%, without compromising on the efficiency of the target application performance. There are many stopword lists existing for Non-Indian languages like English, Arabic, French and German. Even for a few Indian languages like Hindi, Sanskrit and, Tamil substantial lists are available. But as of date very little research work is reported for one of the widely used Indian language namely Gujarati. As per our survey, for the Gujarati language, two major approaches have been suggested for stopword identification. The first approach is giving a static generic stopword list, and another approach is a Rule-based approach. The major drawback of these method is their inability to handle neologism. In this paper, we have suggested domain-specific, robust and dynamic stopword list identification mechanism developed for documents written in the Gujarati language. In our proposed approach, we take the top \"N\" words as seed words based on their frequency and later add other \"M\" similar context word which are identified by word embeddings. Further the effectiveness of removing these listed (N+M) stop words was checked by applying the stopword removal preprocessing phase in the Text Classification (TC) and Information Retrieval (IR) applications. In TC model, the feature vector reduces by approximately 16%, and on other hand, the accuracy of the TC model increased by nearly 3 %. The experiments also found, removal of the these stop words in IR application, increased the Mean Average Precision (MAP) of the system by nearly 31%. Thus, the overall time and space requirements were decreased without compromising on the end results of system.","PeriodicalId":135318,"journal":{"name":"2021 International Conference on Artificial Intelligence and Machine Vision (AIMV)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Automatic stopword Identification Technique for Gujarati text\",\"authors\":\"Dhara J. Ladani, NIKITA PARITOSH DESAI\",\"doi\":\"10.1109/aimv53313.2021.9670968\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Natural Language Processing (NLP) is an Artificially Intelligent (AI) mechanism that allows computers to intelligently analyze, comprehend, and derive meaning from human language. In natural language text processing, common words like ‘a’, ‘the’, ‘is’, ‘an’, etc. are known as a stopwords. They are typically considered having no informative value. It is proved that one of the major benefits of removing stopword in NLP text-based processing is the reduction of the text in the corpus by 35 - 45%, without compromising on the efficiency of the target application performance. There are many stopword lists existing for Non-Indian languages like English, Arabic, French and German. Even for a few Indian languages like Hindi, Sanskrit and, Tamil substantial lists are available. But as of date very little research work is reported for one of the widely used Indian language namely Gujarati. As per our survey, for the Gujarati language, two major approaches have been suggested for stopword identification. The first approach is giving a static generic stopword list, and another approach is a Rule-based approach. The major drawback of these method is their inability to handle neologism. In this paper, we have suggested domain-specific, robust and dynamic stopword list identification mechanism developed for documents written in the Gujarati language. In our proposed approach, we take the top \\\"N\\\" words as seed words based on their frequency and later add other \\\"M\\\" similar context word which are identified by word embeddings. Further the effectiveness of removing these listed (N+M) stop words was checked by applying the stopword removal preprocessing phase in the Text Classification (TC) and Information Retrieval (IR) applications. In TC model, the feature vector reduces by approximately 16%, and on other hand, the accuracy of the TC model increased by nearly 3 %. The experiments also found, removal of the these stop words in IR application, increased the Mean Average Precision (MAP) of the system by nearly 31%. Thus, the overall time and space requirements were decreased without compromising on the end results of system.\",\"PeriodicalId\":135318,\"journal\":{\"name\":\"2021 International Conference on Artificial Intelligence and Machine Vision (AIMV)\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 International Conference on Artificial Intelligence and Machine Vision (AIMV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/aimv53313.2021.9670968\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 International Conference on Artificial Intelligence and Machine Vision (AIMV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/aimv53313.2021.9670968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

自然语言处理（NLP）是一种人工智能（AI）机制，它允许计算机智能地分析、理解和推导人类语言的含义。在自然语言文本处理中，"a"、"the"、"is"、"an "等常用词被称为停顿词。它们通常被认为没有信息价值。事实证明，在基于 NLP 的文本处理中，删除停滞词的主要好处之一是在不影响目标应用性能效率的情况下，将语料库中的文本减少 35 - 45%。非印度语言（如英语、阿拉伯语、法语和德语）中存在许多停滞词列表。即使是印地语、梵语和泰米尔语等少数印度语言也有大量的词表。但迄今为止，针对古吉拉特语这一广泛使用的印度语言的研究成果却少之又少。根据我们的调查，对于古吉拉特语，有两种主要的停顿词识别方法。第一种方法是提供一个静态的通用停格词列表，另一种方法是基于规则的方法。这些方法的主要缺点是无法处理新词。在本文中，我们提出了针对古吉拉特语文档开发的特定领域、稳健且动态的停止词列表识别机制。在我们提出的方法中，我们根据词频将前 "N "个词作为种子词，然后添加其他 "M "个类似的上下文词，这些词是通过词嵌入识别出来的。此外，我们还在文本分类（TC）和信息检索（IR）应用中应用了停滞词去除预处理阶段，以检验去除这些列出的（N+M）停滞词的有效性。在文本分类模型中，特征向量减少了约 16%，另一方面，文本分类模型的准确率提高了近 3%。实验还发现，在 IR 应用中去除这些停顿词后，系统的平均精确度（MAP）提高了近 31%。因此，在不影响系统最终结果的情况下，总体时间和空间要求都有所降低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Automatic stopword Identification Technique for Gujarati text

Natural Language Processing (NLP) is an Artificially Intelligent (AI) mechanism that allows computers to intelligently analyze, comprehend, and derive meaning from human language. In natural language text processing, common words like ‘a’, ‘the’, ‘is’, ‘an’, etc. are known as a stopwords. They are typically considered having no informative value. It is proved that one of the major benefits of removing stopword in NLP text-based processing is the reduction of the text in the corpus by 35 - 45%, without compromising on the efficiency of the target application performance. There are many stopword lists existing for Non-Indian languages like English, Arabic, French and German. Even for a few Indian languages like Hindi, Sanskrit and, Tamil substantial lists are available. But as of date very little research work is reported for one of the widely used Indian language namely Gujarati. As per our survey, for the Gujarati language, two major approaches have been suggested for stopword identification. The first approach is giving a static generic stopword list, and another approach is a Rule-based approach. The major drawback of these method is their inability to handle neologism. In this paper, we have suggested domain-specific, robust and dynamic stopword list identification mechanism developed for documents written in the Gujarati language. In our proposed approach, we take the top "N" words as seed words based on their frequency and later add other "M" similar context word which are identified by word embeddings. Further the effectiveness of removing these listed (N+M) stop words was checked by applying the stopword removal preprocessing phase in the Text Classification (TC) and Information Retrieval (IR) applications. In TC model, the feature vector reduces by approximately 16%, and on other hand, the accuracy of the TC model increased by nearly 3 %. The experiments also found, removal of the these stop words in IR application, increased the Mean Average Precision (MAP) of the system by nearly 31%. Thus, the overall time and space requirements were decreased without compromising on the end results of system.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 International Conference on Artificial Intelligence and Machine Vision (AIMV)

自引率

0.00%

发文量