Word-like character n-gram embedding

NUT@EMNLP Pub Date : 2018-11-01 DOI:10.18653/v1/W18-6120
Geewook Kim, Kazuki Fukui, Hidetoshi Shimodaira
{"title":"Word-like character n-gram embedding","authors":"Geewook Kim, Kazuki Fukui, Hidetoshi Shimodaira","doi":"10.18653/v1/W18-6120","DOIUrl":null,"url":null,"abstract":"We propose a new word embedding method called word-like character n-gram embedding, which learns distributed representations of words by embedding word-like character n-grams. Our method is an extension of recently proposed segmentation-free word embedding, which directly embeds frequent character n-grams from a raw corpus. However, its n-gram vocabulary tends to contain too many non-word n-grams. We solved this problem by introducing an idea of expected word frequency. Compared to the previously proposed methods, our method can embed more words, along with the words that are not included in a given basic word dictionary. Since our method does not rely on word segmentation with rich word dictionaries, it is especially effective when the text in the corpus is in unsegmented language and contains many neologisms and informal words (e.g., Chinese SNS dataset). Our experimental results on Sina Weibo (a Chinese microblog service) and Twitter show that the proposed method can embed more words and improve the performance of downstream tasks.","PeriodicalId":207795,"journal":{"name":"NUT@EMNLP","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"NUT@EMNLP","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/W18-6120","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

We propose a new word embedding method called word-like character n-gram embedding, which learns distributed representations of words by embedding word-like character n-grams. Our method is an extension of recently proposed segmentation-free word embedding, which directly embeds frequent character n-grams from a raw corpus. However, its n-gram vocabulary tends to contain too many non-word n-grams. We solved this problem by introducing an idea of expected word frequency. Compared to the previously proposed methods, our method can embed more words, along with the words that are not included in a given basic word dictionary. Since our method does not rely on word segmentation with rich word dictionaries, it is especially effective when the text in the corpus is in unsegmented language and contains many neologisms and informal words (e.g., Chinese SNS dataset). Our experimental results on Sina Weibo (a Chinese microblog service) and Twitter show that the proposed method can embed more words and improve the performance of downstream tasks.
类词字符n-图嵌入
我们提出了一种新的词嵌入方法——类词字符n-图嵌入,它通过嵌入类词字符n-图来学习词的分布式表示。我们的方法是最近提出的无分词嵌入的扩展,该方法直接嵌入原始语料库中的频繁字符n-图。然而,它的n-gram词汇往往包含太多的非单词n-gram。我们通过引入期望词频的概念解决了这个问题。与之前提出的方法相比,我们的方法可以嵌入更多的单词,以及没有包含在给定基本单词字典中的单词。由于我们的方法不依赖于丰富的词词典的分词,所以当语料库中的文本是未分割的语言并且包含许多新词和非正式词(例如中文SNS数据集)时,它特别有效。我们在新浪微博和Twitter上的实验结果表明,该方法可以嵌入更多的单词,并提高下游任务的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信