Universal constituency treebanking and parsing: A pilot study

IF 3.4 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jianling Li , Meishan Zhang , Jianrong Wang , Min Zhang , Yue Zhang
{"title":"Universal constituency treebanking and parsing: A pilot study","authors":"Jianling Li ,&nbsp;Meishan Zhang ,&nbsp;Jianrong Wang ,&nbsp;Min Zhang ,&nbsp;Yue Zhang","doi":"10.1016/j.csl.2025.101826","DOIUrl":null,"url":null,"abstract":"<div><div>Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101826"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000518","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Universal language processing is crucial for developing models that work across multiple languages. However, universal constituency parsing has lagged due to the lack of annotated universal constituency (UC) treebanks. To address this, we propose two cost-effective approaches. First, we unify existing annotated language-specific treebanks using phrase label mapping to create UC trees, but this is limited to only a handful of languages. Second, we develop a novel method to convert Universal Dependency (UD) treebanks into UC treebanks using large language models (LLMs) with syntactic knowledge, enabling the construction of UC treebanks for over 150 languages. We adopt the graph-based max margin model as our baseline and introduce a language adapter to fine-tune the universal parser. Our experiments show that the language adapter maintains performance for high-resource languages and improves performance for low-resource languages. We evaluate different scales of multilingual pre-trained models, confirming the effectiveness and robustness of our approach. In summary, we conduct the first pilot study on universal constituency parsing, introducing novel methods for creating and utilizing UC treebanks, thereby advancing treebanking and parsing methodologies.1
通用选区树银行和解析:一项试点研究
通用语言处理对于开发跨多种语言工作的模型至关重要。然而,由于缺乏注释的通用选区(UC)树库,通用选区解析滞后。为了解决这个问题,我们提出了两种具有成本效益的方法。首先,我们使用短语标签映射来统一现有的带注释的特定语言树库,以创建UC树,但这仅限于少数语言。其次,我们开发了一种新的方法,利用具有语法知识的大型语言模型(llm)将通用依赖(UD)树库转换为UC树库,从而能够构建超过150种语言的UC树库。我们采用基于图的最大边界模型作为基准,并引入语言适配器来微调通用解析器。我们的实验表明,语言适配器可以保持高资源语言的性能,并提高低资源语言的性能。我们评估了不同尺度的多语言预训练模型,证实了我们方法的有效性和稳健性。总之,我们对通用选区解析进行了首次试点研究,引入了创建和利用UC树库的新方法,从而推进了树库和解析方法
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信