The Consistency between Popular Generative Artificial Intelligence (AI) Robots in Evaluating the User Experience of Mobile Device Operating Systems

AHFE international Pub Date : 2023-01-01 DOI:10.54941/ahfe1004193

Victor K Y Chan

{"title":"The Consistency between Popular Generative Artificial Intelligence (AI) Robots in Evaluating the User Experience of Mobile Device Operating Systems","authors":"Victor K Y Chan","doi":"10.54941/ahfe1004193","DOIUrl":null,"url":null,"abstract":"This article attempts to study the consistency, among other auxiliary comparisons, between popular generative artificial intelligence (AI) robots in the evaluation of various perceived user experience dimensions of mobile device operating system versions or, more specifically, iOS and Android versions. A handful of robots were experimented with, ending up with Dragonfly and GPT-4 being the only two eligible for in-depth investigation where the duo was individually requested to accord rating scores to the six major dimensions, namely (1) efficiency, (2) effectiveness, (3) learnability, (4) satisfaction, (5) accessibility, and (6) security, of the operating system versions. It is noteworthy that these dimensions are from the perceived user experience’s point of view instead of any “physical” technology’s standpoint. For each of the two robots, the minimum, the maximum, the range, and the standard deviation of the rating scores for each of the six dimensions were computed across all the versions. The rating score difference for each of the six dimensions between the two robots was calculated for each version. The mean of the absolute value, the minimum, the maximum, the range, and the standard deviation of the differences for each dimension between the two robots were calculated across all versions. A paired sample t-test was then applied to each dimension for the rating score differences between the two robots over all the versions. Finally, a correlation coefficient of the rating scores was computed for each dimension between the two robots across all the versions. These computational outcomes were to confirm whether the two robots awarded discrimination in evaluating each dimension across the versions, whether any of the two robots systematically underrated or overrated any dimension vis-à-vis the other robot, and whether there was consistency between the two robots in evaluating each dimension across the versions. It was found that discrimination was apparent in the evaluation of all dimensions, GPT-4 systematically underrated the dimensions satisfaction (p = 0.002 < 0.05) and security (p = 0.008 < 0.05) compared with Dragonfly, and the evaluation by the two robots was almost impeccably consistent for the six dimensions with the correlation coefficients ranging from 0.679 to 0.892 (p from 0.000 to 0.003 < 0.05). Consistency implies at least the partial trustworthiness of the evaluation of these mobile device operating system versions by either of these two popular generative AI robots based on the analogous concept of convergent validity.","PeriodicalId":470195,"journal":{"name":"AHFE international","volume":"121 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AHFE international","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54941/ahfe1004193","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This article attempts to study the consistency, among other auxiliary comparisons, between popular generative artificial intelligence (AI) robots in the evaluation of various perceived user experience dimensions of mobile device operating system versions or, more specifically, iOS and Android versions. A handful of robots were experimented with, ending up with Dragonfly and GPT-4 being the only two eligible for in-depth investigation where the duo was individually requested to accord rating scores to the six major dimensions, namely (1) efficiency, (2) effectiveness, (3) learnability, (4) satisfaction, (5) accessibility, and (6) security, of the operating system versions. It is noteworthy that these dimensions are from the perceived user experience’s point of view instead of any “physical” technology’s standpoint. For each of the two robots, the minimum, the maximum, the range, and the standard deviation of the rating scores for each of the six dimensions were computed across all the versions. The rating score difference for each of the six dimensions between the two robots was calculated for each version. The mean of the absolute value, the minimum, the maximum, the range, and the standard deviation of the differences for each dimension between the two robots were calculated across all versions. A paired sample t-test was then applied to each dimension for the rating score differences between the two robots over all the versions. Finally, a correlation coefficient of the rating scores was computed for each dimension between the two robots across all the versions. These computational outcomes were to confirm whether the two robots awarded discrimination in evaluating each dimension across the versions, whether any of the two robots systematically underrated or overrated any dimension vis-à-vis the other robot, and whether there was consistency between the two robots in evaluating each dimension across the versions. It was found that discrimination was apparent in the evaluation of all dimensions, GPT-4 systematically underrated the dimensions satisfaction (p = 0.002 < 0.05) and security (p = 0.008 < 0.05) compared with Dragonfly, and the evaluation by the two robots was almost impeccably consistent for the six dimensions with the correlation coefficients ranging from 0.679 to 0.892 (p from 0.000 to 0.003 < 0.05). Consistency implies at least the partial trustworthiness of the evaluation of these mobile device operating system versions by either of these two popular generative AI robots based on the analogous concept of convergent validity.

查看原文本刊更多论文

流行的生成式人工智能(AI)机器人在评估移动设备操作系统用户体验中的一致性

本文试图在其他辅助比较中，研究流行的生成式人工智能(AI)机器人在评估移动设备操作系统版本(更具体地说，是iOS和Android版本)的各种感知用户体验维度时的一致性。少数机器人进行了实验，最终蜻蜓和GPT-4是仅有的两个有资格进行深入调查的机器人，其中二人分别被要求根据六个主要维度进行评分，即(1)效率，(2)有效性，(3)可学习性，(4)满意度，(5)可访问性，(6)操作系统版本的安全性。值得注意的是，这些维度是从感知用户体验的角度出发，而不是从任何“物理”技术的角度出发。对于这两个机器人中的每一个，在所有版本中计算六个维度中每一个评分的最小值、最大值、范围和标准偏差。计算了两个机器人在六个维度上的评分差异。在所有版本中计算两个机器人之间每个维度差异的绝对值、最小值、最大值、范围和标准差的平均值。然后对每个维度进行配对样本t检验，以确定两个机器人在所有版本中的评分差异。最后，计算两个机器人在所有版本中每个维度的评分的相关系数。这些计算结果是为了确认两个机器人是否在评估版本之间的每个维度时给予歧视，两个机器人中是否有任何一个机器人系统地低估或高估了-à-vis另一个机器人的任何维度，以及两个机器人在评估版本之间的每个维度时是否存在一致性。结果发现，在各维度的评价中存在明显的歧视，GPT-4系统地低估了维度满意度(p = 0.002 <0.05)和安全性(p = 0.008 <与蜻蜓相比，两种机器人在六个维度上的评价几乎是完美一致的，相关系数在0.679 ~ 0.892之间(p从0.000 ~ 0.003 <0.05)。一致性意味着，基于类似的收敛有效性概念，这两种流行的生成式人工智能机器人对这些移动设备操作系统版本的评估至少具有部分可信度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

AHFE international

自引率

0.00%

发文量