Semantic representations for under-resourced languages

Research Conference of the South African Institute of Computer Scientists and Information Technologists Pub Date : 2019-09-17 DOI:10.1145/3351108.3351133

Jocelyn Mazarura, A. D. Waal, J. D. Villiers

引用次数: 1

Abstract

Distributional semantics studies methods for learning semantic representation of natural text. The semantic similarity between words and documents can be derived from this presentation which leads to other practical NLP applications such as collaborative filtering, aspect-based sentiment analysis, intent classification for chatbots and machine translation. Under-resourced language data is small in size. Small data implies not only small corpora, but also short documents within the corpus. In this paper we investigate the performance of word embedding techniques on two under-resourced languages. We investigate two topic models, LDA and DMM as well as a word embedding word2vec. We find DMM to perform better than LDA as a topic model embedding. DMM and word2vec perform similar in a semantic evaluation task of aligned corpora.

查看原文本刊更多论文

资源不足语言的语义表示

分布语义学研究自然文本语义表示的学习方法。单词和文档之间的语义相似性可以从这个演示中得到，从而导致其他实际的NLP应用，如协同过滤、基于方面的情感分析、聊天机器人的意图分类和机器翻译。资源不足的语言数据规模很小。小数据不仅意味着语料库小，还意味着语料库中的短文档。在本文中，我们研究了词嵌入技术在两种资源不足语言上的性能。我们研究了两个主题模型，LDA和DMM，以及一个词嵌入word2vec。我们发现DMM在主题模型嵌入方面的表现优于LDA。DMM和word2vec在对齐语料库的语义评估任务中执行相似。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Research Conference of the South African Institute of Computer Scientists and Information Technologists

自引率

0.00%

发文量