{"title":"对话式人工智能的基准、道德一致性和评估框架:推进ChatGPT的负责任发展","authors":"Partha Pratim Ray","doi":"10.1016/j.tbench.2023.100136","DOIUrl":null,"url":null,"abstract":"<div><p>Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 3","pages":"Article 100136"},"PeriodicalIF":0.0000,"publicationDate":"2023-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT\",\"authors\":\"Partha Pratim Ray\",\"doi\":\"10.1016/j.tbench.2023.100136\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.</p></div>\",\"PeriodicalId\":100155,\"journal\":{\"name\":\"BenchCouncil Transactions on Benchmarks, Standards and Evaluations\",\"volume\":\"3 3\",\"pages\":\"Article 100136\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BenchCouncil Transactions on Benchmarks, Standards and Evaluations\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772485923000534\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772485923000534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Benchmarking, ethical alignment, and evaluation framework for conversational AI: Advancing responsible development of ChatGPT
Conversational AI systems like ChatGPT have seen remarkable advancements in recent years, revolutionizing human–computer interactions. However, evaluating the performance and ethical implications of these systems remains a challenge. This paper delves into the creation of rigorous benchmarks, adaptable standards, and an intelligent evaluation methodology tailored specifically for ChatGPT. We meticulously analyze several prominent benchmarks, including GLUE, SuperGLUE, SQuAD, CoQA, Persona-Chat, DSTC, BIG-Bench, HELM and MMLU illuminating their strengths and limitations. This paper also scrutinizes the existing standards set by OpenAI, IEEE’s Ethically Aligned Design, the Montreal Declaration, and Partnership on AI’s Tenets, investigating their relevance to ChatGPT. Further, we propose adaptive standards that encapsulate ethical considerations, context adaptability, and community involvement. In terms of evaluation, we explore traditional methods like BLEU, ROUGE, METEOR, precision–recall, F1 score, perplexity, and user feedback, while also proposing a novel evaluation approach that harnesses the power of reinforcement learning. Our proposed evaluation framework is multidimensional, incorporating task-specific, real-world application, and multi-turn dialogue benchmarks. We perform feasibility analysis, SWOT analysis and adaptability analysis of the proposed framework. The framework highlights the significance of user feedback, integrating it as a core component of evaluation alongside subjective assessments and interactive evaluation sessions. By amalgamating these elements, this paper contributes to the development of a comprehensive evaluation framework that fosters responsible and impactful advancement in the field of conversational AI.