词嵌入方法稳定吗?我们应该关注它吗?

Proceedings of the 32nd ACM Conference on Hypertext and Social Media Pub Date : 2021-04-17 DOI:10.1145/3465336.3475098

Angana Borah, M. Barman, Amit Awekar

{"title":"词嵌入方法稳定吗?我们应该关注它吗?","authors":"Angana Borah, M. Barman, Amit Awekar","doi":"10.1145/3465336.3475098","DOIUrl":null,"url":null,"abstract":"A representation learning method is considered stable if it consistently generates similar representation of the given data across multiple runs. Word Embedding Methods (WEMs) are a class of representation learning methods that generate dense vector representation for each word in the given text data. The central idea of this paper is to explore the stability measurement of WEMs using intrinsic evaluation based on word similarity. We experiment with three popular WEMs: Word2Vec, GloVe, and fastText. For stability measurement, we investigate the effect of five parameters involved in training these models. We perform experiments using four real-world datasets from different domains: Wikipedia, News, Song lyrics, and European parliament proceedings. We also observe the effect of WEM stability on two downstream tasks: Clustering and Fairness evaluation. Our experiments indicate that amongst the three WEMs, fastText is the most stable, followed by GloVe and Word2Vec.","PeriodicalId":325072,"journal":{"name":"Proceedings of the 32nd ACM Conference on Hypertext and Social Media","volume":"59 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Are Word Embedding Methods Stable and Should We Care About It?\",\"authors\":\"Angana Borah, M. Barman, Amit Awekar\",\"doi\":\"10.1145/3465336.3475098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A representation learning method is considered stable if it consistently generates similar representation of the given data across multiple runs. Word Embedding Methods (WEMs) are a class of representation learning methods that generate dense vector representation for each word in the given text data. The central idea of this paper is to explore the stability measurement of WEMs using intrinsic evaluation based on word similarity. We experiment with three popular WEMs: Word2Vec, GloVe, and fastText. For stability measurement, we investigate the effect of five parameters involved in training these models. We perform experiments using four real-world datasets from different domains: Wikipedia, News, Song lyrics, and European parliament proceedings. We also observe the effect of WEM stability on two downstream tasks: Clustering and Fairness evaluation. Our experiments indicate that amongst the three WEMs, fastText is the most stable, followed by GloVe and Word2Vec.\",\"PeriodicalId\":325072,\"journal\":{\"name\":\"Proceedings of the 32nd ACM Conference on Hypertext and Social Media\",\"volume\":\"59 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-04-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 32nd ACM Conference on Hypertext and Social Media\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3465336.3475098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 32nd ACM Conference on Hypertext and Social Media","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3465336.3475098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

如果表示学习方法在多次运行中始终如一地生成给定数据的类似表示，则认为该方法是稳定的。词嵌入方法(WEMs)是一类表示学习方法，它为给定文本数据中的每个词生成密集向量表示。本文的中心思想是利用基于词相似度的内在评价来探讨微加工产品的稳定性度量。我们试验了三种流行的WEMs: Word2Vec、GloVe和fastText。对于稳定性测量，我们研究了训练这些模型所涉及的五个参数的影响。我们使用来自不同领域的四个真实世界数据集进行实验:维基百科、新闻、歌词和欧洲议会会议。我们还观察了WEM稳定性对两个下游任务:聚类和公平性评估的影响。我们的实验表明，在三个WEMs中，fastText是最稳定的，其次是GloVe和Word2Vec。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Are Word Embedding Methods Stable and Should We Care About It?

A representation learning method is considered stable if it consistently generates similar representation of the given data across multiple runs. Word Embedding Methods (WEMs) are a class of representation learning methods that generate dense vector representation for each word in the given text data. The central idea of this paper is to explore the stability measurement of WEMs using intrinsic evaluation based on word similarity. We experiment with three popular WEMs: Word2Vec, GloVe, and fastText. For stability measurement, we investigate the effect of five parameters involved in training these models. We perform experiments using four real-world datasets from different domains: Wikipedia, News, Song lyrics, and European parliament proceedings. We also observe the effect of WEM stability on two downstream tasks: Clustering and Fairness evaluation. Our experiments indicate that amongst the three WEMs, fastText is the most stable, followed by GloVe and Word2Vec.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 32nd ACM Conference on Hypertext and Social Media

自引率

0.00%

发文量