CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality

Annual Meeting of the Association for Computational Linguistics Pub Date : 2023-06-20 DOI:10.48550/arXiv.2306.11477

Liang Li, Ruiying Geng, Chengyang Fang, Bing Li, Can Ma, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li

{"title":"CATS: A Pragmatic Chinese Answer-to-Sequence Dataset with Large Scale and High Quality","authors":"Liang Li, Ruiying Geng, Chengyang Fang, Bing Li, Can Ma, Rongyu Cao, Binhua Li, Fei Huang, Yongbin Li","doi":"10.48550/arXiv.2306.11477","DOIUrl":null,"url":null,"abstract":"There are three problems existing in the popular data-to-text datasets. First, the large-scale datasets either contain noise or lack real application scenarios. Second, the datasets close to real applications are relatively small in size. Last, current datasets bias in the English language while leaving other languages underexplored.To alleviate these limitations, in this paper, we present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality. The dataset aims to generate textual descriptions for the answer in the practical TableQA system.Further, to bridge the structural gap between the input SQL and table and establish better semantic alignments, we propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources and convert this task to a graph-to-text problem. The experiment results demonstrate the effectiveness of our proposed method. Further analysis on CATS attests to both the high quality and challenges of the dataset","PeriodicalId":352845,"journal":{"name":"Annual Meeting of the Association for Computational Linguistics","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annual Meeting of the Association for Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2306.11477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

There are three problems existing in the popular data-to-text datasets. First, the large-scale datasets either contain noise or lack real application scenarios. Second, the datasets close to real applications are relatively small in size. Last, current datasets bias in the English language while leaving other languages underexplored.To alleviate these limitations, in this paper, we present CATS, a pragmatic Chinese answer-to-sequence dataset with large scale and high quality. The dataset aims to generate textual descriptions for the answer in the practical TableQA system.Further, to bridge the structural gap between the input SQL and table and establish better semantic alignments, we propose a Unified Graph Transformation approach to establish a joint encoding space for the two hybrid knowledge resources and convert this task to a graph-to-text problem. The experiment results demonstrate the effectiveness of our proposed method. Further analysis on CATS attests to both the high quality and challenges of the dataset

查看原文本刊更多论文

CATS:一个大规模、高质量的汉语语用答案序列数据集

目前流行的数据到文本数据集存在三个问题。首先，大规模数据集要么包含噪声，要么缺乏真实的应用场景。其次，接近实际应用的数据集规模相对较小。最后，目前的数据集偏向于英语，而对其他语言的研究不足。为了解决这些问题，本文提出了一个大规模、高质量的汉语答案序列数据集CATS。该数据集旨在为实际TableQA系统中的答案生成文本描述。此外，为了弥合输入SQL和表之间的结构差距，建立更好的语义对齐，我们提出了一种统一的图转换方法，为两种混合知识资源建立联合编码空间，并将该任务转换为图到文本的问题。实验结果证明了该方法的有效性。对CATS的进一步分析证明了数据集的高质量和挑战

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annual Meeting of the Association for Computational Linguistics

自引率

0.00%

发文量