Semantic Data Set Construction from Human Clustering and Spatial Arrangement

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2021-04-01 DOI:10.1162/coli_a_00396

Olga Majewska, Diana McCarthy, Jasper J. F. van den Bosch, N. Kriegeskorte, Ivan Vulic, A. Korhonen

{"title":"Semantic Data Set Construction from Human Clustering and Spatial Arrangement","authors":"Olga Majewska, Diana McCarthy, Jasper J. F. van den Bosch, N. Kriegeskorte, Ivan Vulic, A. Korhonen","doi":"10.1162/coli_a_00396","DOIUrl":null,"url":null,"abstract":"Abstract Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multi-way similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work. We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity. We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"47 1","pages":"69-116"},"PeriodicalIF":5.3000,"publicationDate":"2021-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00396","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 5

Abstract

Abstract Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multi-way similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work. We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity. We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”

查看原文本刊更多论文

基于人聚类和空间排列的语义数据集构建

摘要对词汇语义表征学习模型的研究通常利用某种形式的内在评价来确保所学习的表征反映人类的语义判断。词汇语义相似性估计是一种广泛使用的评估方法，但通常侧重于孤立地对单词进行成对判断，或者仅限于特定的上下文和词汇刺激。这些方法存在局限性，要么不提供任何上下文进行判断，从而忽略歧义，要么提供非常具体的句子上下文，然后无法用于生成更大的词汇资源。此外，不考虑两个以上项目之间的相似性。我们对我们最近提出的大规模数据集构建方法进行了全面的描述和分析，该方法在第一阶段产生了大量动词样本的语义分类，并在第二阶段对由此产生的语义类进行了多向相似性判断。该方法使用认知神经科学领域提出的空间多重排列方法来捕捉视觉刺激的多向相似性判断。我们已经将这种方法应用于处理多义词的语言刺激和比以前的工作大得多的样本。我们专门针对动词，但这种方法同样适用于其他词性。我们对第一阶段的数据进行了聚类分析，并展示了这在构建综合动词资源中是如何有用的。我们还分析了第二阶段捕获的语义信息，并讨论了空间诱导的相似性判断的潜力，以更好地反映人类对单词相似性的概念。我们展示了如何将生成的数据集用于对语义聚类和语义相似性的内在任务的表示学习模型进行细粒度分析和评估。特别是，我们发现更强的静态单词嵌入方法在单词级别的相似性和聚类方面仍然优于最近的预训练方法中出现的词汇表示。此外，由于数据集的广泛覆盖，我们能够通过评估特定语义域（如“热”或“运动”）上的FrameNet和VerbNet改进模型，来比较专门化特定类型外部知识的向量表示的好处

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.