Better Pretrained Embedding with Convolutional Neural Networks for Morphological Stemming

Y. Oo, K. Soe
{"title":"Better Pretrained Embedding with Convolutional Neural Networks for Morphological Stemming","authors":"Y. Oo, K. Soe","doi":"10.1145/3348488.3348499","DOIUrl":null,"url":null,"abstract":"Words are considered as independent entities without any direct relationship among morphologically related word. So, some rare words are poorly estimated and unknown words are represented only a few vectors. The process of stemming is to reduce different forms to a common morphological root. Word embedding is a good generalization to unseen words and that can capture general syntactic as well as semantic properties of word. Furthermore, deep learning approaches have become more and more prominent in NLP tasks and pre-trained embedding layers have been applied to improve the performance of neural network architectures for many NLP applications. However, word segmentation for Myanmar Language, like for most Asian Languages, is a vital task and widely-studied sequence labeling problem. Normally, stemming is considered as a separate process from segmentation. In this paper, new approach indicates segmentation boundaries when it performs stemming. This paper proposes several word representations from character and syllable level and they are corporate in convolutional neural network (CNN-based model) which jointly learns stemming and segmentation boundaries in parallel. It is also evaluated the performance of convolutional neural network that relies on different pre-trained embedding. According to the experimental results, the pre-trained embedding has a vast effect on the performance.","PeriodicalId":420290,"journal":{"name":"International Conference on Artificial Intelligence and Virtual Reality","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Artificial Intelligence and Virtual Reality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3348488.3348499","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Words are considered as independent entities without any direct relationship among morphologically related word. So, some rare words are poorly estimated and unknown words are represented only a few vectors. The process of stemming is to reduce different forms to a common morphological root. Word embedding is a good generalization to unseen words and that can capture general syntactic as well as semantic properties of word. Furthermore, deep learning approaches have become more and more prominent in NLP tasks and pre-trained embedding layers have been applied to improve the performance of neural network architectures for many NLP applications. However, word segmentation for Myanmar Language, like for most Asian Languages, is a vital task and widely-studied sequence labeling problem. Normally, stemming is considered as a separate process from segmentation. In this paper, new approach indicates segmentation boundaries when it performs stemming. This paper proposes several word representations from character and syllable level and they are corporate in convolutional neural network (CNN-based model) which jointly learns stemming and segmentation boundaries in parallel. It is also evaluated the performance of convolutional neural network that relies on different pre-trained embedding. According to the experimental results, the pre-trained embedding has a vast effect on the performance.
基于卷积神经网络的更好的形态学词干预训练嵌入
词被认为是独立的实体,在词法上相关的词之间没有直接关系。因此,一些罕见的词被估计得很差,而未知的词只有几个向量表示。词干化的过程是将不同的词根简化为一个共同的词根。词嵌入是对未见词的一种很好的泛化,它可以捕获词的一般语法和语义属性。此外,深度学习方法在NLP任务中变得越来越突出,预训练的嵌入层已被应用于许多NLP应用中,以提高神经网络架构的性能。然而,与大多数亚洲语言一样,缅甸语的分词是一项重要的任务,也是被广泛研究的序列标注问题。通常,词干提取被认为是与分段分离的一个过程。本文提出了一种新的词干提取方法,该方法在词干提取时指出了词干分割的边界。本文从字符和音节两个层面提出了几种词表示方法,并在卷积神经网络(基于cnn的模型)中并行地联合学习词干和分词边界。并对不同预训练嵌入方式下卷积神经网络的性能进行了评价。实验结果表明,预训练的嵌入对性能有很大的影响。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信