Tero Hakala, Tiina Lindh-Knuutila, Annika Hultén, Minna Lehtonen, R. Salmelin
{"title":"Subword representations successfully decode brain responses to morphologically complex written words","authors":"Tero Hakala, Tiina Lindh-Knuutila, Annika Hultén, Minna Lehtonen, R. Salmelin","doi":"10.1162/nol_a_00149","DOIUrl":null,"url":null,"abstract":"\n This study extends the idea of decoding word-evoked brain activations using a corpus-semantic vector space to multimorphemic words in the agglutinative Finnish language. The corpus-semantic models are trained on word segments, and decoding is carried out with word vectors that are composed of these segments. We tested several alternative vector-space models using different segmentations: no segmentation (whole word), linguistic morphemes, statistical morphemes, random segmentation, and character-level 1-, 2- and 3-grams, and paired them with recorded MEG responses to multimorphemic words in a visual word recognition task. For all variants, the decoding accuracy exceeded the standard word-label permutation-based significance thresholds at 350--500 ms after stimulus onset. However, the critical segment-label permutation test revealed that only those segmentations that were morphologically aware reached significance in the brain decoding task. The results suggest that both whole-word forms and morphemes are represented in the brain and show that neural decoding using corpus-semantic word representations derived from compositional subword segments is applicable also for multimorphemic word forms. This is especially relevant for languages with complex morphology, because a large proportion of word forms are rare and it can be difficult to find statistically reliable surface representations for them in any large corpus.","PeriodicalId":34845,"journal":{"name":"Neurobiology of Language","volume":null,"pages":null},"PeriodicalIF":3.6000,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurobiology of Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/nol_a_00149","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0
Abstract
This study extends the idea of decoding word-evoked brain activations using a corpus-semantic vector space to multimorphemic words in the agglutinative Finnish language. The corpus-semantic models are trained on word segments, and decoding is carried out with word vectors that are composed of these segments. We tested several alternative vector-space models using different segmentations: no segmentation (whole word), linguistic morphemes, statistical morphemes, random segmentation, and character-level 1-, 2- and 3-grams, and paired them with recorded MEG responses to multimorphemic words in a visual word recognition task. For all variants, the decoding accuracy exceeded the standard word-label permutation-based significance thresholds at 350--500 ms after stimulus onset. However, the critical segment-label permutation test revealed that only those segmentations that were morphologically aware reached significance in the brain decoding task. The results suggest that both whole-word forms and morphemes are represented in the brain and show that neural decoding using corpus-semantic word representations derived from compositional subword segments is applicable also for multimorphemic word forms. This is especially relevant for languages with complex morphology, because a large proportion of word forms are rare and it can be difficult to find statistically reliable surface representations for them in any large corpus.
这项研究将利用语料-语义向量空间对单词诱发的大脑激活进行解码的想法扩展到了多词素芬兰语中。语料库-语义模型在词段上进行训练,解码则通过由这些词段组成的词向量进行。我们使用不同的词段测试了几种可供选择的向量空间模型:无词段(整词)、语言词素、统计词素、随机词段以及字符级 1-、2-和 3-词素,并将它们与视觉单词识别任务中记录的多词素单词的 MEG 反应配对。对于所有变体,在刺激开始后350-500毫秒时,解码准确率都超过了基于词标签排列的标准显著性阈值。然而,临界词段标签排列测试表明,只有那些具有形态意识的词段才能在大脑解码任务中达到显著性。这些结果表明,全词形式和词素都能在大脑中得到表征,并表明使用由组成性子词片段衍生的语料库-语义词表征进行神经解码也适用于多词素词形式。这对于形态复杂的语言尤为重要,因为很大一部分词形是罕见的,很难在任何大型语料库中找到统计上可靠的表面表征。