基于共现、语境和向量空间模型的词相似度度量的实证比较

Q3 Social Sciences

Journal of Information Science Theory and Practice Pub Date : 2020-01-01 DOI:10.1633/JISTAP.2020.8.2.1

Natsuki Kadowaki, Kazuaki Kishida

{"title":"基于共现、语境和向量空间模型的词相似度度量的实证比较","authors":"Natsuki Kadowaki, Kazuaki Kishida","doi":"10.1633/JISTAP.2020.8.2.1","DOIUrl":null,"url":null,"abstract":"Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSMand LDA-based similarity measures performed best.","PeriodicalId":37582,"journal":{"name":"Journal of Information Science Theory and Practice","volume":"16 1","pages":"6-17"},"PeriodicalIF":0.0000,"publicationDate":"2020-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Empirical Comparison of Word Similarity Measures Based on Co-Occurrence, Context, and a Vector Space Model\",\"authors\":\"Natsuki Kadowaki, Kazuaki Kishida\",\"doi\":\"10.1633/JISTAP.2020.8.2.1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSMand LDA-based similarity measures performed best.\",\"PeriodicalId\":37582,\"journal\":{\"name\":\"Journal of Information Science Theory and Practice\",\"volume\":\"16 1\",\"pages\":\"6-17\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Science Theory and Practice\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1633/JISTAP.2020.8.2.1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Science Theory and Practice","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1633/JISTAP.2020.8.2.1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 1

摘要

在信息检索领域和其他相关领域，经常测量单词相似度以提高系统性能。本文报道了一个基于50个有意从路透社语料库中选择的词计算的词相似度度量值的实验比较。有三个目标，包括(1)基于共现的相似度度量(共现频率计算为文档或句子的数量)，(2)基于潜在狄利克雷分配(LDA)，非负矩阵分解(NMF)和Word2Vec算法获得的基于上下文的分布相似度度量，以及(3)根据向量空间模型(VSM)从每个单词的tf-idf权重计算的相似度度量。在这里，一对基于vsm的相似性度量和根据文档数量基于共发生的相似性度量的Pearson相关系数最高。群体平均聚类分层聚类也应用于个体测度计算的相似矩阵。根据答案集对聚类集进行评估显示，基于VSMand lda的相似性度量表现最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Empirical Comparison of Word Similarity Measures Based on Co-Occurrence, Context, and a Vector Space Model

Word similarity is often measured to enhance system performance in the information retrieval field and other related areas. This paper reports on an experimental comparison of values for word similarity measures that were computed based on 50 intentionally selected words from a Reuters corpus. There were three targets, including (1) co-occurrence-based similarity measures (for which a co-occurrence frequency is counted as the number of documents or sentences), (2) context-based distributional similarity measures obtained from a latent Dirichlet allocation (LDA), nonnegative matrix factorization (NMF), and Word2Vec algorithm, and (3) similarity measures computed from the tf-idf weights of each word according to a vector space model (VSM). Here, a Pearson correlation coefficient for a pair of VSM-based similarity measures and co-occurrence-based similarity measures according to the number of documents was highest. Group-average agglomerative hierarchical clustering was also applied to similarity matrices computed by individual measures. An evaluation of the cluster sets according to an answer set revealed that VSMand LDA-based similarity measures performed best.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Information Science Theory and Practice Social Sciences-Library and Information Sciences

CiteScore

1.10

自引率

0.00%

发文量

审稿时长

12 weeks

期刊介绍： The Journal of Information Science Theory and Practice (JISTaP) is an international journal that aims at publishing original studies, review papers and brief communications on information science theory and practice. The journal provides an international forum for practical as well as theoretical research in the interdisciplinary areas of information science, such as information processing and management, knowledge organization, scholarly communication and bibliometrics. To foster scholarly communication among researchers and practitioners of library and information science around the globe, JISTaP offers a no-fee open access publishing venue where a team of dedicated editors, reviewers and staff members volunteer their services to ensure rapid dissemination and communication of scholarly works that make significant contributions. In a modern society, where information production and consumption grow at an astronomical rate, the science of information management, organization, and analysis is invaluable in effective utilization of information. The key objective of the journal is to foster research that can contribute to advancements and innovations in the theory and practice of information and library science so as to promote timely application of the findings from scientific investigations to everyday life. Recognizing the importance of the global perspective with understanding of region-specific issues, JISTaP encourages submissions of manuscripts that discuss global implications of regional findings as well as regional implications of global findings.