掩码词性模型:长上下文建模是否有助于无监督pos标注?

North American Chapter of the Association for Computational Linguistics Pub Date : 2022-06-30 DOI:10.48550/arXiv.2206.14969

Xiang Zhou, Shiyue Zhang, Mohit Bansal

{"title":"掩码词性模型:长上下文建模是否有助于无监督pos标注?","authors":"Xiang Zhou, Shiyue Zhang, Mohit Bansal","doi":"10.48550/arXiv.2206.14969","DOIUrl":null,"url":null,"abstract":"Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model’s ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges.","PeriodicalId":382084,"journal":{"name":"North American Chapter of the Association for Computational Linguistics","volume":"50 2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?\",\"authors\":\"Xiang Zhou, Shiyue Zhang, Mohit Bansal\",\"doi\":\"10.48550/arXiv.2206.14969\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model’s ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges.\",\"PeriodicalId\":382084,\"journal\":{\"name\":\"North American Chapter of the Association for Computational Linguistics\",\"volume\":\"50 2 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"North American Chapter of the Association for Computational Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2206.14969\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"North American Chapter of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.14969","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

以前的词性归纳模型通常假设一定的独立性假设(如马尔可夫、单向、局部依赖)，而这些假设在真实语言中并不成立。例如，主谓一致既可以是长期的，也可以是双向的。为了方便灵活的依赖建模，我们受到最近成功的掩码语言模型(MLM)的启发，提出了一个掩码词性模型(MPoSM)。MPoSM可以对任意标签依赖性进行建模，并通过掩码词性重构的目的进行词性归纳。我们在英文Penn WSJ数据集以及包含10种不同语言的通用树库上都取得了具有竞争力的结果。虽然建立长期依赖关系的模型在理想情况下应该有助于这项任务，但我们的消融研究显示，不同语言的趋势不一。为了更好地理解这一现象，我们设计了一个新的综合实验，可以专门诊断模型学习标签一致性的能力。令人惊讶的是，我们发现即使是强大的基线也不能在一个非常简单的设置中始终如一地解决这个问题:相邻单词之间的一致性。尽管如此，MPoSM总体上实现了更好的性能。最后，我们进行了详细的误差分析，以阐明其他遗留的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Masked Part-Of-Speech Model: Does Modeling Long Context Help Unsupervised POS-tagging?

Previous Part-Of-Speech (POS) induction models usually assume certain independence assumptions (e.g., Markov, unidirectional, local dependency) that do not hold in real languages. For example, the subject-verb agreement can be both long-term and bidirectional. To facilitate flexible dependency modeling, we propose a Masked Part-of-Speech Model (MPoSM), inspired by the recent success of Masked Language Models (MLM). MPoSM can model arbitrary tag dependency and perform POS induction through the objective of masked POS reconstruction. We achieve competitive results on both the English Penn WSJ dataset as well as the universal treebank containing 10 diverse languages. Though modeling the long-term dependency should ideally help this task, our ablation study shows mixed trends in different languages. To better understand this phenomenon, we design a novel synthetic experiment that can specifically diagnose the model’s ability to learn tag agreement. Surprisingly, we find that even strong baselines fail to solve this problem consistently in a very simplified setting: the agreement between adjacent words. Nonetheless, MPoSM achieves overall better performance. Lastly, we conduct a detailed error analysis to shed light on other remaining challenges.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

North American Chapter of the Association for Computational Linguistics

自引率

0.00%

发文量