从异构来源生成产品答案:一个新的基准和最佳实践

Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5) Pub Date : 1900-01-01 DOI:10.18653/v1/2022.ecnlp-1.13

Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, B. Byrne, A. Gispert

{"title":"从异构来源生成产品答案:一个新的基准和最佳实践","authors":"Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, B. Byrne, A. Gispert","doi":"10.18653/v1/2022.ecnlp-1.13","DOIUrl":null,"url":null,"abstract":"It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.","PeriodicalId":384006,"journal":{"name":"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Product Answer Generation from Heterogeneous Sources: A New Benchmark and Best Practices\",\"authors\":\"Xiaoyu Shen, Gianni Barlacchi, Marco Del Tredici, Weiwei Cheng, B. Byrne, A. Gispert\",\"doi\":\"10.18653/v1/2022.ecnlp-1.13\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.\",\"PeriodicalId\":384006,\"journal\":{\"name\":\"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)\",\"volume\":\"134 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.18653/v1/2022.ecnlp-1.13\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.18653/v1/2022.ecnlp-1.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

基于web产品页面上可用的异构信息源(如半结构化属性、文本描述、用户提供的内容等)来回答产品问题具有很大的价值。然而，这些来源具有不同的结构和写作风格，这对(1)证据排序，(2)来源选择和(3)答案生成提出了挑战。在本文中，我们建立了一个包含6个信息源的证据选择和答案生成的带有注释的基准。基于这个基准，我们进行了全面的研究，并提出了一套最佳实践。我们表明，所有的来源都是重要的，有助于回答问题。处理单个模型中的所有源可以跨源产生可比较的置信度分数，并且组合多个源进行训练总是有帮助的，即使对于具有完全不同结构的源也是如此。我们进一步提出了一种新的数据增强方法来迭代地创建用于答案生成的训练样本，该方法只需要几千个注释就可以达到接近人类的性能。最后，我们对模型预测进行了深入的误差分析，并强调了未来研究的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Product Answer Generation from Heterogeneous Sources: A New Benchmark and Best Practices

It is of great value to answer product questions based on heterogeneous information sources available on web product pages, e.g., semi-structured attributes, text descriptions, user-provided contents, etc. However, these sources have different structures and writing styles, which poses challenges for (1) evidence ranking, (2) source selection, and (3) answer generation. In this paper, we build a benchmark with annotations for both evidence selection and answer generation covering 6 information sources. Based on this benchmark, we conduct a comprehensive study and present a set of best practices. We show that all sources are important and contribute to answering questions. Handling all sources within one single model can produce comparable confidence scores across sources and combining multiple sources for training always helps, even for sources with totally different structures. We further propose a novel data augmentation method to iteratively create training samples for answer generation, which achieves close-to-human performance with only a few thousandannotations. Finally, we perform an in-depth error analysis of model predictions and highlight the challenges for future research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of The Fifth Workshop on e-Commerce and NLP (ECNLP 5)

自引率

0.00%

发文量