Semantic representations for under-resourced languages

Jocelyn Mazarura, A. D. Waal, J. D. Villiers
{"title":"Semantic representations for under-resourced languages","authors":"Jocelyn Mazarura, A. D. Waal, J. D. Villiers","doi":"10.1145/3351108.3351133","DOIUrl":null,"url":null,"abstract":"Distributional semantics studies methods for learning semantic representation of natural text. The semantic similarity between words and documents can be derived from this presentation which leads to other practical NLP applications such as collaborative filtering, aspect-based sentiment analysis, intent classification for chatbots and machine translation. Under-resourced language data is small in size. Small data implies not only small corpora, but also short documents within the corpus. In this paper we investigate the performance of word embedding techniques on two under-resourced languages. We investigate two topic models, LDA and DMM as well as a word embedding word2vec. We find DMM to perform better than LDA as a topic model embedding. DMM and word2vec perform similar in a semantic evaluation task of aligned corpora.","PeriodicalId":269578,"journal":{"name":"Research Conference of the South African Institute of Computer Scientists and Information Technologists","volume":"136 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Conference of the South African Institute of Computer Scientists and Information Technologists","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3351108.3351133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

Distributional semantics studies methods for learning semantic representation of natural text. The semantic similarity between words and documents can be derived from this presentation which leads to other practical NLP applications such as collaborative filtering, aspect-based sentiment analysis, intent classification for chatbots and machine translation. Under-resourced language data is small in size. Small data implies not only small corpora, but also short documents within the corpus. In this paper we investigate the performance of word embedding techniques on two under-resourced languages. We investigate two topic models, LDA and DMM as well as a word embedding word2vec. We find DMM to perform better than LDA as a topic model embedding. DMM and word2vec perform similar in a semantic evaluation task of aligned corpora.
资源不足语言的语义表示
分布语义学研究自然文本语义表示的学习方法。单词和文档之间的语义相似性可以从这个演示中得到,从而导致其他实际的NLP应用,如协同过滤、基于方面的情感分析、聊天机器人的意图分类和机器翻译。资源不足的语言数据规模很小。小数据不仅意味着语料库小,还意味着语料库中的短文档。在本文中,我们研究了词嵌入技术在两种资源不足语言上的性能。我们研究了两个主题模型,LDA和DMM,以及一个词嵌入word2vec。我们发现DMM在主题模型嵌入方面的表现优于LDA。DMM和word2vec在对齐语料库的语义评估任务中执行相似。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信