Getting the ##life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?

Special Interest Group on Computational Morphology and Phonology Workshop Pub Date : 2020-07-01 DOI:10.18653/v1/2020.sigmorphon-1.24

Stav Klein, Reut Tsarfaty

{"title":"Getting the ##life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?","authors":"Stav Klein, Reut Tsarfaty","doi":"10.18653/v1/2020.sigmorphon-1.24","DOIUrl":null,"url":null,"abstract":"This work investigates the most basic units that underlie contextualized word embeddings, such as BERT — the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Modern Hebrew, as a proxy to evaluate the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We suggest that linguistically-informed word-pieces schemes, that make the morphological structure explicit, might boost performance for MRLs.","PeriodicalId":186158,"journal":{"name":"Special Interest Group on Computational Morphology and Phonology Workshop","volume":"43 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Special Interest Group on Computational Morphology and Phonology Workshop","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2020.sigmorphon-1.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

Abstract

This work investigates the most basic units that underlie contextualized word embeddings, such as BERT — the so-called word pieces. In Morphologically-Rich Languages (MRLs) which exhibit morphological fusion and non-concatenative morphology, the different units of meaning within a word may be fused, intertwined, and cannot be separated linearly. Therefore, when using word-pieces in MRLs, we must consider that: (1) a linear segmentation into sub-word units might not capture the full morphological complexity of words; and (2) representations that leave morphological knowledge on sub-word units inaccessible might negatively affect performance. Here we empirically examine the capacity of word-pieces to capture morphology by investigating the task of multi-tagging in Modern Hebrew, as a proxy to evaluate the underlying segmentation. Our results show that, while models trained to predict multi-tags for complete words outperform models tuned to predict the distinct tags of WPs, we can improve the WPs tag prediction by purposefully constraining the word-pieces to reflect their internal functions. We suggest that linguistically-informed word-pieces schemes, that make the morphological structure explicit, might boost performance for MRLs.

查看原文本刊更多论文

从生活中获得生命:对复杂形态建模的词块有多充分?

这项工作调查了构成语境化词嵌入的最基本单位，如BERT——即所谓的词块。在形态丰富的语言中，表现为形态融合和非连接形态，一个词中的不同意义单位可能是融合的，相互交织的，不能线性分离。因此，在MRLs中使用词块时，我们必须考虑到:(1)对子词单元的线性分割可能无法捕捉到词的全部形态复杂性;(2)子词单位的形态知识不可访问的表征可能会对性能产生负面影响。在这里，我们通过调查现代希伯来语中的多重标记任务，作为评估潜在分词的代理，实证研究了词块捕捉形态学的能力。我们的研究结果表明，虽然训练用于预测完整词的多标签的模型优于调整用于预测WPs的单独标签的模型，但我们可以通过有目的地约束词块来反映其内部功能来改进WPs标签预测。我们认为，使词形结构明确的语言知情词块方案可能会提高核磁共振成像的表现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Special Interest Group on Computational Morphology and Phonology Workshop

自引率

0.00%

发文量