Knowledge-aware audio-grounded generative slot filling for limited annotated data

IF 3.1 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2024-08-05 DOI:10.1016/j.csl.2024.101707

Guangzhi Sun , Chao Zhang , Ivan Vulić , Paweł Budzianowski , Philip C. Woodland

{"title":"Knowledge-aware audio-grounded generative slot filling for limited annotated data","authors":"Guangzhi Sun , Chao Zhang , Ivan Vulić , Paweł Budzianowski , Philip C. Woodland","doi":"10.1016/j.csl.2024.101707","DOIUrl":null,"url":null,"abstract":"<div><p>Manually annotating fine-grained slot-value labels for task-oriented dialogue (ToD) systems is an expensive and time-consuming endeavour. This motivates research into slot-filling methods that operate with limited amounts of labelled data. Moreover, the majority of current work on ToD is based solely on text as the input modality, neglecting the additional challenges of imperfect automatic speech recognition (ASR) when working with spoken language. In this work, we propose a Knowledge-Aware Audio-Grounded generative slot filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input. KA2G achieves robust and data-efficient slot filling for speech-based ToD by (1) framing it as a text generation task, (2) grounding text generation additionally in the audio modality, and (3) conditioning on available external knowledge (<em>e.g.</em> a predefined list of possible slot values). We show that combining both modalities within the KA2G framework improves the robustness against ASR errors. Further, the knowledge-aware slot-value generator in KA2G, implemented via a pointer generator mechanism, particularly benefits few-shot and zero-shot learning. Experiments, conducted on the standard speech-based single-turn SLURP dataset and a multi-turn dataset extracted from a commercial ToD system, display strong and consistent gains over prior work, especially in few-shot and zero-shot setups.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"89 ","pages":"Article 101707"},"PeriodicalIF":3.1000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0885230824000901/pdfft?md5=f629f96f3e24fa1b58c6bf9d7f53386f&pid=1-s2.0-S0885230824000901-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824000901","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Manually annotating fine-grained slot-value labels for task-oriented dialogue (ToD) systems is an expensive and time-consuming endeavour. This motivates research into slot-filling methods that operate with limited amounts of labelled data. Moreover, the majority of current work on ToD is based solely on text as the input modality, neglecting the additional challenges of imperfect automatic speech recognition (ASR) when working with spoken language. In this work, we propose a Knowledge-Aware Audio-Grounded generative slot filling framework, termed KA2G, that focuses on few-shot and zero-shot slot filling for ToD with speech input. KA2G achieves robust and data-efficient slot filling for speech-based ToD by (1) framing it as a text generation task, (2) grounding text generation additionally in the audio modality, and (3) conditioning on available external knowledge (e.g. a predefined list of possible slot values). We show that combining both modalities within the KA2G framework improves the robustness against ASR errors. Further, the knowledge-aware slot-value generator in KA2G, implemented via a pointer generator mechanism, particularly benefits few-shot and zero-shot learning. Experiments, conducted on the standard speech-based single-turn SLURP dataset and a multi-turn dataset extracted from a commercial ToD system, display strong and consistent gains over prior work, especially in few-shot and zero-shot setups.

查看原文本刊更多论文

针对有限注释数据的知识感知音频生成槽填充

为面向任务的对话（ToD）系统手动标注细粒度的槽值标签是一项既费钱又费时的工作。这就促使人们研究利用有限的标注数据进行时隙填充的方法。此外，目前有关 ToD 的大部分研究工作都是以文本作为输入模式，而忽略了在处理口语时不完善的自动语音识别（ASR）所带来的额外挑战。在这项工作中，我们提出了一个知识感知音频-地基生成式插槽填充框架（称为 KA2G），该框架专注于使用语音输入进行 ToD 的少镜头和零镜头插槽填充。KA2G 通过（1）将语音 ToD 定义为文本生成任务，（2）将文本生成额外建立在音频模态上，以及（3）以可用的外部知识（预定义的可能槽值列表）为条件，实现了基于语音的 ToD 的稳健且数据高效的槽填充。我们的研究表明，在 KA2G 框架内将两种模态结合在一起可提高抗 ASR 错误的鲁棒性。此外，KA2G 中的知识感知时隙值生成器是通过指针生成器机制实现的，尤其有利于少次学习和零次学习。在基于标准语音的单匝 SLURP 数据集和从商业 ToD 系统中提取的多匝数据集上进行的实验显示，KA2G 比之前的研究成果具有更强、更稳定的优势，尤其是在少匝和零匝设置中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.