从异构来源生成产品答案:一个新的基准和最佳实践

Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, B. Byrne, A. Gispert
{"title":"从异构来源生成产品答案:一个新的基准和最佳实践","authors":"Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, B. Byrne, A. Gispert","doi":"10.18653/v1/2022.ecnlp-1.13","DOIUrl":null,"url":null,"abstract":"It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.","PeriodicalId":384006,"journal":{"name":"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Product Answer Generation from Heterogeneous Sources: A New Benchmark and Best Practices\",\"authors\":\"Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, B. Byrne, A. Gispert\",\"doi\":\"10.18653/v1/2022.ecnlp-1.13\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.\",\"PeriodicalId\":384006,\"journal\":{\"name\":\"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)\",\"volume\":\"134 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.ecnlp-1.13\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.ecnlp-1.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

摘要

基于web产品页面上可用的异构信息源(如半结构化属性、文本描述、用户提供的内容等)来回答产品问题具有很大的价值。然而,这些来源具有不同的结构和写作风格,这对(1)证据排序,(2)来源选择和(3)答案生成提出了挑战。在本文中,我们建立了一个包含6个信息源的证据选择和答案生成的带有注释的基准。基于这个基准,我们进行了全面的研究,并提出了一套最佳实践。我们表明,所有的来源都是重要的,有助于回答问题。处理单个模型中的所有源可以跨源产生可比较的置信度分数,并且组合多个源进行训练总是有帮助的,即使对于具有完全不同结构的源也是如此。我们进一步提出了一种新的数据增强方法来迭代地创建用于答案生成的训练样本,该方法只需要几千个注释就可以达到接近人类的性能。最后,我们对模型预测进行了深入的误差分析,并强调了未来研究的挑战。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Product Answer Generation from Heterogeneous Sources: A New Benchmark and Best Practices
It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信