对话式人工智能的基准、道德一致性和评估框架:推进ChatGPT的负责任发展

BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2023-09-01 DOI:10.1016/j.tbench.2023.100136

Partha Pratim Ray

{"title":"对话式人工智能的基准、道德一致性和评估框架:推进ChatGPT的负责任发展","authors":"Partha Pratim Ray","doi":"10.1016/j.tbench.2023.100136","DOIUrl":null,"url":null,"abstract":"<div><p>Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100136"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT\",\"authors\":\"Partha Pratim Ray\",\"doi\":\"10.1016/j.tbench.2023.100136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.</p></div>\",\"PeriodicalId\":100155,\"journal\":{\"name\":\"BenchCouncil Transactions on Benchmarks, Standards and Evaluations\",\"volume\":\"3 3\",\"pages\":\"Article 100136\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BenchCouncil Transactions on Benchmarks, Standards and Evaluations\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772485923000534\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772485923000534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

近年来，像ChatGPT这样的对话式人工智能系统取得了显著进步，彻底改变了人机交互。然而，评估这些系统的性能和道德影响仍然是一项挑战。本文深入探讨了创建严格的基准、适应性标准和专门为ChatGPT量身定制的智能评估方法。我们仔细分析了几个突出的基准，包括GLUE、SuperGLUE、SQuAD、CoQA、Persona Chat、DSTC、BIG Bench、HELM和MMLU，阐明了它们的优势和局限性。本文还仔细审查了OpenAI、IEEE的道德一致设计、蒙特利尔宣言和人工智能信条伙伴关系制定的现有标准，调查了它们与ChatGPT的相关性。此外，我们提出了适应性标准，包括伦理考虑、环境适应性和社区参与。在评估方面，我们探索了传统的方法，如BLEU、ROUGE、METEOR、精确回忆、F1分数、困惑和用户反馈，同时还提出了一种利用强化学习力量的新评估方法。我们提出的评估框架是多层面的，包括特定任务、现实世界的应用程序和多回合对话基准。我们对所提出的框架进行了可行性分析、SWOT分析和适应性分析。该框架强调了用户反馈的重要性，将其与主观评估和互动评估会议一起作为评估的核心组成部分。通过整合这些元素，本文有助于开发一个全面的评估框架，促进对话人工智能领域负责任和有影响力的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT

Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BenchCouncil Transactions on Benchmarks, Standards and Evaluations

CiteScore

4.80

自引率

0.00%

发文量