Transactions of the Association for Computational Linguistics最新文献

筛选
英文 中文
An Efficient Self-Supervised Cross-View Training For Sentence Embedding 用于句子嵌入的高效自监督交叉视图训练
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-11-06 DOI: 10.1162/tacl_a_00620
Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, E. Chuangsuwanich, Sarana Nutanong
{"title":"An Efficient Self-Supervised Cross-View Training For Sentence Embedding","authors":"Peerat Limkonchotiwat, Wuttikorn Ponwitayarat, Lalita Lowphansirikul, Can Udomcharoenchaikit, E. Chuangsuwanich, Sarana Nutanong","doi":"10.1162/tacl_a_00620","DOIUrl":"https://doi.org/10.1162/tacl_a_00620","url":null,"abstract":"Abstract Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.1","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139288567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
U-CORE: A Unified Deep Cluster-wise Contrastive Framework for Open Relation Extraction U-CORE:用于开放关系提取的统一深度聚类对比框架
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-11-01 DOI: 10.1162/tacl_a_00604
Jie Zhou, Shenpo Dong, Yunxin Huang, Meihan Wu, Haili Li, Jingnan Wang, Hongkui Tu, Xiaodong Wang
{"title":"U-CORE: A Unified Deep Cluster-wise Contrastive Framework for Open Relation Extraction","authors":"Jie Zhou, Shenpo Dong, Yunxin Huang, Meihan Wu, Haili Li, Jingnan Wang, Hongkui Tu, Xiaodong Wang","doi":"10.1162/tacl_a_00604","DOIUrl":"https://doi.org/10.1162/tacl_a_00604","url":null,"abstract":"Abstract Within Open Relation Extraction (ORE) tasks, the Zero-shot ORE method is to generalize undefined relations from predefined relations, while the Unsupervised ORE method is to extract undefined relations without the need for annotations. However, despite the possibility of overlap between predefined and undefined relations in the training data, a unified framework for both Zero-shot and Unsupervised ORE has yet to be established. To address this gap, we propose U-CORE: A Unified Deep Cluster-wise Contrastive Framework for both Zero-shot and Unsupervised ORE, by leveraging techniques from Contrastive Learning (CL) and Clustering.1 U-CORE overcomes the limitations of CL-based Zero-shot ORE methods by employing Cluster-wise CL that preserves both local smoothness as well as global semantics. Additionally, we employ a deep-cluster-based updater that optimizes the cluster center, thus enhancing the accuracy and efficiency of the model. To increase the stability of the model, we adopt Adaptive Self-paced Learning that effectively addresses the data-shifting problems. Experimental results on three well-known datasets demonstrate that U-CORE significantly improves upon existing methods by showing an average improvement of 7.35% ARI on Zero-shot ORE tasks and 15.24% ARI on Unsupervised ORE tasks.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139297367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR AfriSpeech-200:用于临床和通用领域 ASR 的泛非洲重音语音数据集
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-09-30 DOI: 10.1162/tacl_a_00627
Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, C. Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, A. Tonja, Naome A. Etori, Clinton Mbataku
{"title":"AfriSpeech-200: Pan-African Accented Speech Dataset for Clinical and General Domain ASR","authors":"Tobi Olatunji, Tejumade Afonja, Aditya Yadavalli, C. Emezue, Sahib Singh, Bonaventure F. P. Dossou, Joanne Osuchukwu, Salomey Osei, A. Tonja, Naome A. Etori, Clinton Mbataku","doi":"10.1162/tacl_a_00627","DOIUrl":"https://doi.org/10.1162/tacl_a_00627","url":null,"abstract":"Abstract Africa has a very poor doctor-to-patient ratio. At very busy clinics, doctors could see 30+ patients per day—a heavy patient burden compared with developed countries—but productivity tools such as clinical automatic speech recognition (ASR) are lacking for these overworked clinicians. However, clinical ASR is mature, even ubiquitous, in developed nations, and clinician-reported performance of commercial clinical ASR systems is generally satisfactory. Furthermore, the recent performance of general domain ASR is approaching human accuracy. However, several gaps exist. Several publications have highlighted racial bias with speech-to-text algorithms and performance on minority accents lags significantly. To our knowledge, there is no publicly available research or benchmark on accented African clinical ASR, and speech data is non-existent for the majority of African accents. We release AfriSpeech, 200hrs of Pan-African English speech, 67,577 clips from 2,463 unique speakers across 120 indigenous accents from 13 countries for clinical and general domain ASR, a benchmark test set, with publicly available pre-trained models with SOTA performance on the AfriSpeech benchmark.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139332019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages MIRACL:一个涵盖18种不同语言的多语言检索数据集
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-09-01 DOI: 10.1162/tacl_a_00595
Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin
{"title":"MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages","authors":"Xinyu Crystina Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, Jimmy Lin","doi":"10.1162/tacl_a_00595","DOIUrl":"https://doi.org/10.1162/tacl_a_00595","url":null,"abstract":"Abstract MIRACL is a multilingual dataset for ad hoc retrieval across 18 languages that collectively encompass over three billion native speakers around the world. This resource is designed to support monolingual retrieval tasks, where the queries and the corpora are in the same language. In total, we have gathered over 726k high-quality relevance judgments for 78k queries over Wikipedia in these languages, where all annotations have been performed by native speakers hired by our team. MIRACL covers languages that are both typologically close as well as distant from 10 language families and 13 sub-families, associated with varying amounts of publicly available resources. Extensive automatic heuristic verification and manual assessments were performed during the annotation process to control data quality. In total, MIRACL represents an investment of around five person-years of human annotator effort. Our goal is to spur research on improving retrieval across a continuum of languages, thus enhancing information access capabilities for diverse populations around the world, particularly those that have traditionally been underserved. MIRACL is available at http://miracl.ai/.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64440768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Shared Lexical Items as Triggers of Code Switching 共享词条是代码转换的触发器
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-08-29 DOI: 10.1162/tacl_a_00613
S. Wintner, Safaa Shehadi, Yuli Zeira, Doreen Osmelak, Yuval Nov
{"title":"Shared Lexical Items as Triggers of Code Switching","authors":"S. Wintner, Safaa Shehadi, Yuli Zeira, Doreen Osmelak, Yuval Nov","doi":"10.1162/tacl_a_00613","DOIUrl":"https://doi.org/10.1162/tacl_a_00613","url":null,"abstract":"Abstract Why do bilingual speakers code-switch (mix their two languages)? Among the several theories that attempt to explain this natural and ubiquitous phenomenon, the triggering hypothesis relates code-switching to the presence of lexical triggers, specifically cognates and proper names, adjacent to the switch point. We provide a fuller, more nuanced and refined exploration of the triggering hypothesis, based on five large datasets in three language pairs, reflecting both spoken and written bilingual interactions. Our results show that words that are assumed to reside in a mental lexicon shared by both languages indeed trigger code-switching, that the tendency to switch depends on the distance of the trigger from the switch point and on whether the trigger precedes or succeeds the switch, but not on the etymology of the trigger words. We thus provide strong, robust, evidence-based confirmation to several hypotheses on the relationships between lexical triggers and code-switching.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139348580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can Authorship Representation Learning Capture Stylistic Features? 作者表征学习能否捕捉文体特征?
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-08-22 DOI: 10.1162/tacl_a_00610
Andrew Wang, Cristina Aggazzotti, R. Kotula, Rafael A. Rivera Soto, M. Bishop, Nicholas Andrews
{"title":"Can Authorship Representation Learning Capture Stylistic Features?","authors":"Andrew Wang, Cristina Aggazzotti, R. Kotula, Rafael A. Rivera Soto, M. Bishop, Nicholas Andrews","doi":"10.1162/tacl_a_00610","DOIUrl":"https://doi.org/10.1162/tacl_a_00610","url":null,"abstract":"Abstract Automatically disentangling an author’s style from the content of their writing is a longstanding and possibly insurmountable problem in computational linguistics. At the same time, the availability of large text corpora furnished with author labels has recently enabled learning authorship representations in a purely data-driven manner for authorship attribution, a task that ostensibly depends to a greater extent on encoding writing style than encoding content. However, success on this surrogate task does not ensure that such representations capture writing style since authorship could also be correlated with other latent variables, such as topic. In an effort to better understand the nature of the information these representations convey, and specifically to validate the hypothesis that they chiefly encode writing style, we systematically probe these representations through a series of targeted experiments. The results of these experiments suggest that representations learned for the surrogate authorship prediction task are indeed sensitive to writing style. As a consequence, authorship representations may be expected to be robust to certain kinds of data shift, such as topic drift over time. Additionally, our findings may open the door to downstream applications that require stylistic representations, such as style transfer.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139349572","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PaniniQA: Enhancing Patient Education Through Interactive Question Answering PaniniQA:通过交互式问题解答加强患者教育
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-08-07 DOI: 10.1162/tacl_a_00616
Pengshan Cai, Zonghai Yao, Fei Liu, Dakuo Wang, Meghan Reilly, Huixue Zhou, Lingxi Li, Yifan Cao, Alok Kapoor, Adarsha S. Bajracharya, D. Berlowitz, Hongfeng Yu
{"title":"PaniniQA: Enhancing Patient Education Through Interactive Question Answering","authors":"Pengshan Cai, Zonghai Yao, Fei Liu, Dakuo Wang, Meghan Reilly, Huixue Zhou, Lingxi Li, Yifan Cao, Alok Kapoor, Adarsha S. Bajracharya, D. Berlowitz, Hongfeng Yu","doi":"10.1162/tacl_a_00616","DOIUrl":"https://doi.org/10.1162/tacl_a_00616","url":null,"abstract":"Abstract A patient portal allows discharged patients to access their personalized discharge instructions in electronic health records (EHRs). However, many patients have difficulty understanding or memorizing their discharge instructions (Zhao et al., 2017). In this paper, we present PaniniQA, a patient-centric interactive question answering system designed to help patients understand their discharge instructions. PaniniQA first identifies important clinical content from patients’ discharge instructions and then formulates patient-specific educational questions. In addition, PaniniQA is also equipped with answer verification functionality to provide timely feedback to correct patients’ misunderstandings. Our comprehensive automatic & human evaluation results demonstrate our PaniniQA is capable of improving patients’ mastery of their medical instructions through effective interactions.1","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139351516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning to Paraphrase Sentences to Different Complexity Levels 学习仿写不同复杂程度的句子
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-08-04 DOI: 10.1162/tacl_a_00606
Alison Chi, Li-Kuang Chen, Yi-Chen Chang, Shu-Hui Lee, Jason J. S. Chang
{"title":"Learning to Paraphrase Sentences to Different Complexity Levels","authors":"Alison Chi, Li-Kuang Chen, Yi-Chen Chang, Shu-Hui Lee, Jason J. S. Chang","doi":"10.1162/tacl_a_00606","DOIUrl":"https://doi.org/10.1162/tacl_a_00606","url":null,"abstract":"Abstract While sentence simplification is an active research topic in NLP, its adjacent tasks of sentence complexification and same-level paraphrasing are not. To train models on all three tasks, we present two new unsupervised datasets. We compare these datasets, one labeled by a weak classifier and the other by a rule-based approach, with a single supervised dataset. Using these three datasets for training, we perform extensive experiments on both multitasking and prompting strategies. Compared to other systems trained on unsupervised parallel data, models trained on our weak classifier labeled dataset achieve state-of-the-art performance on the ASSET simplification benchmark. Our models also outperform previous work on sentence-level targeting. Finally, we establish how a handful of Large Language Models perform on these tasks under a zero-shot setting.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139351690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Collective Human Opinions in Semantic Textual Similarity 语义文本相似度中的人类集体意见
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-08-01 DOI: 10.1162/tacl_a_00584
Yuxia Wang, Shimin Tao, Ning Xie, Hao Yang, Timothy Baldwin, K. Verspoor
{"title":"Collective Human Opinions in Semantic Textual Similarity","authors":"Yuxia Wang, Shimin Tao, Ning Xie, Hao Yang, Timothy Baldwin, K. Verspoor","doi":"10.1162/tacl_a_00584","DOIUrl":"https://doi.org/10.1162/tacl_a_00584","url":null,"abstract":"Abstract Despite the subjective nature of semantic textual similarity (STS) and pervasive disagreements in STS annotation, existing benchmarks have used averaged human ratings as gold standard. Averaging masks the true distribution of human opinions on examples of low agreement, and prevents models from capturing the semantic vagueness that the individual ratings represent. In this work, we introduce USTS, the first Uncertainty-aware STS dataset with ∼15,000 Chinese sentence pairs and 150,000 labels, to study collective human opinions in STS. Analysis reveals that neither a scalar nor a single Gaussian fits a set of observed judgments adequately. We further show that current STS models cannot capture the variance caused by human disagreement on individual instances, but rather reflect the predictive confidence over the aggregate dataset.","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42343606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Time-and-Space-Efficient Weighted Deduction 时间和空间效率加权扣除
IF 10.9 1区 计算机科学
Transactions of the Association for Computational Linguistics Pub Date : 2023-08-01 DOI: 10.1162/tacl_a_00588
Jason Eisner
{"title":"Time-and-Space-Efficient Weighted Deduction","authors":"Jason Eisner","doi":"10.1162/tacl_a_00588","DOIUrl":"https://doi.org/10.1162/tacl_a_00588","url":null,"abstract":"Abstract Many NLP algorithms have been described in terms of deduction systems. Unweighted deduction allows a generic forward-chaining execution strategy. For weighted deduction, however, efficient execution should propagate the weight of each item only after it has converged. This means visiting the items in topologically sorted order (as in dynamic programming). Toposorting is fast on a materialized graph; unfortunately, materializing the graph would take extra space. Is there a generic weighted deduction strategy which, for every acyclic deduction system and every input, uses only a constant factor more time and space than generic unweighted deduction? After reviewing past strategies, we answer this question in the affirmative by combining ideas of Goodman (1999) and Kahn (1962). We also give an extension to cyclic deduction systems, based on Tarjan (1972).","PeriodicalId":33559,"journal":{"name":"Transactions of the Association for Computational Linguistics","volume":null,"pages":null},"PeriodicalIF":10.9,"publicationDate":"2023-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64440766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信