{"title":"Semantic representations for under-resourced languages","authors":"Jocelyn Mazarura, A. D. Waal, J. D. Villiers","doi":"10.1145/3351108.3351133","DOIUrl":null,"url":null,"abstract":"Distributional semantics studies methods for learning semantic representation of natural text. The semantic similarity between words and documents can be derived from this presentation which leads to other practical NLP applications such as collaborative filtering, aspect-based sentiment analysis, intent classification for chatbots and machine translation. Under-resourced language data is small in size. Small data implies not only small corpora, but also short documents within the corpus. In this paper we investigate the performance of word embedding techniques on two under-resourced languages. We investigate two topic models, LDA and DMM as well as a word embedding word2vec. We find DMM to perform better than LDA as a topic model embedding. DMM and word2vec perform similar in a semantic evaluation task of aligned corpora.","PeriodicalId":269578,"journal":{"name":"Research Conference of the South African Institute of Computer Scientists and Information Technologists","volume":"136 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Conference of the South African Institute of Computer Scientists and Information Technologists","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3351108.3351133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Distributional semantics studies methods for learning semantic representation of natural text. The semantic similarity between words and documents can be derived from this presentation which leads to other practical NLP applications such as collaborative filtering, aspect-based sentiment analysis, intent classification for chatbots and machine translation. Under-resourced language data is small in size. Small data implies not only small corpora, but also short documents within the corpus. In this paper we investigate the performance of word embedding techniques on two under-resourced languages. We investigate two topic models, LDA and DMM as well as a word embedding word2vec. We find DMM to perform better than LDA as a topic model embedding. DMM and word2vec perform similar in a semantic evaluation task of aligned corpora.