Universal constituency treebanking and parsing: A pilot study

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-06-06 DOI:10.1016/j.csl.2025.101826

Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang

{"title":"Universal constituency treebanking and parsing: A pilot study","authors":"Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang","doi":"10.1016/j.csl.2025.101826","DOIUrl":null,"url":null,"abstract":"<div><div>Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101826"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000518","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.¹

查看原文本刊更多论文

通用选区树银行和解析：一项试点研究

通用语言处理对于开发跨多种语言工作的模型至关重要。然而，由于缺乏注释的通用选区（UC）树库，通用选区解析滞后。为了解决这个问题，我们提出了两种具有成本效益的方法。首先，我们使用短语标签映射来统一现有的带注释的特定语言树库，以创建UC树，但这仅限于少数语言。其次，我们开发了一种新的方法，利用具有语法知识的大型语言模型（llm）将通用依赖（UD）树库转换为UC树库，从而能够构建超过150种语言的UC树库。我们采用基于图的最大边界模型作为基准，并引入语言适配器来微调通用解析器。我们的实验表明，语言适配器可以保持高资源语言的性能，并提高低资源语言的性能。我们评估了不同尺度的多语言预训练模型，证实了我们方法的有效性和稳健性。总之，我们对通用选区解析进行了首次试点研究，引入了创建和利用UC树库的新方法，从而推进了树库和解析方法

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.