“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound Words

Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation Pub Date : 2022-12-09 DOI:10.1145/3574318.3574346

Debasis Ganguly, Shripad Bhat, Chandan Biswas

{"title":"“If you can’t beat them, join them”: A Word Transformation based Generalized Skip-gram for Embedding Compound Words","authors":"Debasis Ganguly, Shripad Bhat, Chandan Biswas","doi":"10.1145/3574318.3574346","DOIUrl":null,"url":null,"abstract":"While a class of data-driven approaches has been shown to be effective in embedding words of languages that are relatively simple as per inflections and compounding characteristics (e.g. English), an open area of investigation is ways of integrating language-specific characteristics within the framework of an embedding model. Standard word embedding approaches, such as word2vec, Glove etc. embed each word into a high dimensional dense vector. However, these approaches may not adequately capture the inherent linguistic phenomenon namely that of word compounding. We propose a stochastic word transformation based generalization of the skip-gram algorithm, which seeks to potentially improve the representation of the compositional compound words by leveraging information from the contexts of their constituents. Our experiments show that addressing the compounding effect of a language as a part of the word embedding objective outperforms existing methods of compounding-specific post-transformation based approaches on word semantics prediction and word polarity prediction tasks.","PeriodicalId":270700,"journal":{"name":"Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3574318.3574346","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

While a class of data-driven approaches has been shown to be effective in embedding words of languages that are relatively simple as per inflections and compounding characteristics (e.g. English), an open area of investigation is ways of integrating language-specific characteristics within the framework of an embedding model. Standard word embedding approaches, such as word2vec, Glove etc. embed each word into a high dimensional dense vector. However, these approaches may not adequately capture the inherent linguistic phenomenon namely that of word compounding. We propose a stochastic word transformation based generalization of the skip-gram algorithm, which seeks to potentially improve the representation of the compositional compound words by leveraging information from the contexts of their constituents. Our experiments show that addressing the compounding effect of a language as a part of the word embedding objective outperforms existing methods of compounding-specific post-transformation based approaches on word semantics prediction and word polarity prediction tasks.

查看原文本刊更多论文

“打不过他们，就加入他们”:一种基于词变换的复合词嵌入广义跳跃图

虽然一类数据驱动的方法已被证明在嵌入相对简单的屈折和复合特征(例如英语)的语言单词方面是有效的，但一个开放的研究领域是在嵌入模型的框架内集成语言特定特征的方法。标准的词嵌入方法，如word2vec, Glove等，将每个词嵌入到一个高维密集向量中。然而，这些方法可能没有充分捕捉到固有的语言现象，即词的合成。我们提出了一种基于skip-gram算法的随机词变换泛化，该算法试图通过利用其成分上下文的信息来潜在地改善组合复合词的表示。我们的实验表明，将语言的复合效应作为词嵌入目标的一部分，在词语义预测和词极性预测任务上优于现有的基于转换后的特定复合方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 14th Annual Meeting of the Forum for Information Retrieval Evaluation

自引率

0.00%

发文量