Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

IF 1.9 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2022-06-09 DOI:10.1017/S1351324922000225

M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina

{"title":"Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task","authors":"M. Tikhonova, V. Mikhailov, D. Pisarevskaya, Valentin Malykh, Tatiana Shavrina","doi":"10.1017/S1351324922000225","DOIUrl":null,"url":null,"abstract":"Abstract Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"554 - 583"},"PeriodicalIF":1.9000,"publicationDate":"2022-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/S1351324922000225","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 4

Abstract

Abstract Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.

查看原文本刊更多论文

误入歧途还是误入歧途：通过NLI任务探索多语言BERT的语言学知识

最近的研究表明，标准的微调方法可能不稳定，因为容易受到各种随机性的影响，包括但不限于权重初始化、训练数据顺序和硬件。这种脆弱性会导致在同一实验设置下独立微调的同一模型的评估结果、预测置信度和泛化不一致。我们的论文探讨了自然语言推理中的这个问题，这是基准测试实践中的一个常见任务，并将正在进行的研究扩展到多语言环境。我们为法语、德语和瑞典语提出了六个新的文本蕴涵和广泛覆盖的诊断数据集。我们的主要发现是，mBERT模型展示了涉及词汇语义、逻辑和谓词-参数结构的类别的微调不稳定性，并努力学习单调性、否定性、计算性和对称性。我们还观察到，仅使用英语的额外训练数据可以提高泛化性能和微调稳定性，我们将其归因于跨语言迁移能力。然而，额外训练数据中特定特征的比例可能会损害模型实例的性能。我们公开发布这些数据集，希望促进跨语言场景下语言模型(LMs)的诊断研究，特别是在基准测试方面，这可能会促进对LMs中的多语言使用和跨语言知识转移的更全面的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.