BenchCouncil Transactions on Benchmarks, Standards and Evaluations最新文献

筛选
英文 中文
A short summary of evaluatology: The science and engineering of evaluation 评价学简述:评价的科学与工程
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-06-01 DOI: 10.1016/j.tbench.2024.100175
Jianfeng Zhan
{"title":"A short summary of evaluatology: The science and engineering of evaluation","authors":"Jianfeng Zhan","doi":"10.1016/j.tbench.2024.100175","DOIUrl":"10.1016/j.tbench.2024.100175","url":null,"abstract":"<div><div>Evaluation is a crucial aspect of human existence and plays a vital role in each field. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant consequences. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. The science of evaluation addresses the fundamental question: ”Does any evaluation outcome possess a true value?” The engineering of evaluation tackles the challenge of minimizing costs while satisfying the evaluation requirements of stakeholders. To address the above challenges, we propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines, if not all disciplines.</div><div>This is a short summary of Evaluatology (Zhan et al., 2024). The objective of this revised version is to alleviate the readers’ burden caused by the length of the original text. Compared to the original version (Zhan et al., 2024), this revised edition clarifies various concepts like evaluation systems and conditions and streamlines the concept system by eliminating the evaluation model concept. It rectifies errors, rephrases fundamental evaluation issues, and incorporates a case study on CPU evaluation (Wang et al., 2024). For a more comprehensive understanding, please refer to the original article (Zhan et al., 2024). If you wish to cite this work, kindly cite the original article.</div><div><em>Jianfeng Zhan, Lei Wang, Wanling Gao, Hongxiao Li, Chenxi Wang, Yunyou Huang, Yatao Li, Zhengxin Yang, Guoxin Kang, Chunjie Luo, Hainan Ye, Shaopeng Dai, Zhifei Zhang (2024). Evaluatology: The science and engineering of evaluation. BenchCouncil Transactions on Benchmarks, Standards and Evaluations, 4(1), 100162.</em></div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 2","pages":"Article 100175"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142422338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques BinCodex:用于评估二进制代码相似性检测技术的多层次综合数据集
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-06-01 DOI: 10.1016/j.tbench.2024.100163
Peihua Zhang , Chenggang Wu , Zhe Wang
{"title":"BinCodex: A comprehensive and multi-level dataset for evaluating binary code similarity detection techniques","authors":"Peihua Zhang ,&nbsp;Chenggang Wu ,&nbsp;Zhe Wang","doi":"10.1016/j.tbench.2024.100163","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100163","url":null,"abstract":"<div><p>The binary code similarity detection (BCSD) technique can quantitatively measure the differences between two given binaries and give matching results at predefined granularity (e.g., function), and has been widely used in multiple scenarios including software vulnerability search, security patch analysis, malware detection, code clone detection, etc. With the help of deep learning, the BCSD techniques have achieved high accuracy in their evaluation. However, on the one hand, their high accuracy has become indistinguishable due to the lack of a standard dataset, thus being unable to reveal their abilities. On the other hand, since binary code can be easily changed, it is essential to gain a holistic understanding of the underlying transformations including default optimization options, non-default optimization options, and commonly used code obfuscations, thus assessing their impact on the accuracy and adaptability of the BCSD technique. This paper presents our observations regarding the diversity of BCSD datasets and proposes a comprehensive dataset for the BCSD technique. We employ and present detailed evaluation results of various BCSD works, applying different classifications for different types of BCSD tasks, including pure function pairing and vulnerable code detection. Our results show that most BCSD works are capable of adopting default compiler options but are unsatisfactory when facing non-default compiler options and code obfuscation. We take a layered perspective on the BCSD task and point to opportunities for future optimizations in the technologies we consider.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 2","pages":"Article 100163"},"PeriodicalIF":0.0,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485924000152/pdfft?md5=e14058fa183420c2a27c98650ad7e993&pid=1-s2.0-S2772485924000152-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141240102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TensorTable: Extending PyTorch for mixed relational and linear algebra pipelines TensorTable:为混合关系和线性代数管道扩展 PyTorch
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-03-01 DOI: 10.1016/j.tbench.2024.100161
Xu Wen
{"title":"TensorTable: Extending PyTorch for mixed relational and linear algebra pipelines","authors":"Xu Wen","doi":"10.1016/j.tbench.2024.100161","DOIUrl":"10.1016/j.tbench.2024.100161","url":null,"abstract":"<div><p>The mixed relational algebra (RA) and linear algebra (LA) pipelines have become increasingly common in recent years. However, contemporary widely used frameworks struggle to support both RA and LA operators effectively, failing to ensure optimal end-to-end performance due to the cost of LA operators and data conversion. This underscores the demand for a system capable of seamlessly integrating RA and LA while delivering robust end-to-end performance. This paper proposes TensorTable, a tensor system that extends PyTorch to enable mixed RA and LA pipelines. We propose TensorTable as the unified data representation, storing data in a tensor format to prioritize the performance of LA operators and reduce data conversion costs. Relational tables from RA, as well as vectors, matrices, and tensors from LA, can be seamlessly converted into TensorTables. Additionally, we provide TensorTable-based implementations for RA operators and build a system that supports mixed LA and RA pipelines. We implement TensorTable on top of PyTorch, achieving comparable performance for both RA and LA operators, particularly on small datasets. TensorTable achieves a 1.15x-5.63x speedup for mixed pipelines, compared with state-of-the-art frameworks—AIDA and RMA.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 1","pages":"Article 100161"},"PeriodicalIF":0.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485924000139/pdfft?md5=159d30f36fa85195e487f7a07663be37&pid=1-s2.0-S2772485924000139-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140090009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluatology: The science and engineering of evaluation 评价学:评估科学与工程
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-03-01 DOI: 10.1016/j.tbench.2024.100162
Jianfeng Zhan , Lei Wang , Wanling Gao , Hongxiao Li , Chenxi Wang , Yunyou Huang , Yatao Li , Zhengxin Yang , Guoxin Kang , Chunjie Luo , Hainan Ye , Shaopeng Dai , Zhifei Zhang
{"title":"Evaluatology: The science and engineering of evaluation","authors":"Jianfeng Zhan ,&nbsp;Lei Wang ,&nbsp;Wanling Gao ,&nbsp;Hongxiao Li ,&nbsp;Chenxi Wang ,&nbsp;Yunyou Huang ,&nbsp;Yatao Li ,&nbsp;Zhengxin Yang ,&nbsp;Guoxin Kang ,&nbsp;Chunjie Luo ,&nbsp;Hainan Ye ,&nbsp;Shaopeng Dai ,&nbsp;Zhifei Zhang","doi":"10.1016/j.tbench.2024.100162","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100162","url":null,"abstract":"<div><p>Evaluation is a crucial aspect of human existence and plays a vital role in each field. However, it is often approached in an empirical and ad-hoc manner, lacking consensus on universal concepts, terminologies, theories, and methodologies. This lack of agreement has significant consequences. This article aims to formally introduce the discipline of evaluatology, which encompasses the science and engineering of evaluation. We propose a universal framework for evaluation, encompassing concepts, terminologies, theories, and methodologies that can be applied across various disciplines, if not all disciplines.</p><p>Our research reveals that the essence of evaluation lies in conducting experiments that intentionally apply a well-defined evaluation condition to individuals or systems under scrutiny, which we refer to as the <em>subjects</em>. This process allows for the creation of an evaluation system or model. By measuring and/or testing this evaluation system or model, we can infer the impact of different subjects. Derived from the essence of evaluation, we propose five axioms focusing on key aspects of evaluation outcomes as the foundational evaluation theory. These axioms serve as the bedrock upon which we build universal evaluation theories and methodologies. When evaluating a single subject, it is crucial to create evaluation conditions with different levels of equivalency. By applying these conditions to diverse subjects, we can establish reference evaluation models. These models allow us to alter a single independent variable at a time while keeping all other variables as controls. When evaluating complex scenarios, the key lies in establishing a series of evaluation models that maintain transitivity. Building upon the science of evaluation, we propose a formal definition of a benchmark as a simplified and sampled evaluation condition that guarantees different levels of equivalency. This concept serves as the cornerstone for a universal benchmark-based engineering approach to evaluation across various disciplines, which we refer to as benchmarkology.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 1","pages":"Article 100162"},"PeriodicalIF":0.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485924000140/pdfft?md5=31c7470bd845fb50d0580585f84133b4&pid=1-s2.0-S2772485924000140-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140906873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An approach to workload generation for modern data centers: A view from Alibaba trace 现代数据中心工作负载生成方法:来自阿里巴巴的观点
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-03-01 DOI: 10.1016/j.tbench.2024.100164
Yi Liang , Nianyi Ruan , Lan Yi , Xing Su
{"title":"An approach to workload generation for modern data centers: A view from Alibaba trace","authors":"Yi Liang ,&nbsp;Nianyi Ruan ,&nbsp;Lan Yi ,&nbsp;Xing Su","doi":"10.1016/j.tbench.2024.100164","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100164","url":null,"abstract":"<div><p>Modern data centers provide the foundational infrastructure of cloud computing. Workload generation, which involves simulating or constructing tasks and transactions to replicate the actual resource usage patterns of real-world systems or applications, plays essential role for efficient resource management in these centers. Data center traces, rich in information about workload execution and resource utilization, are thus ideal data for workload generation. Traditional traces provide detailed temporal resource usage data to enable fine-grained workload generation. However, modern data centers tend to favor tracing statistical metrics to reduce overhead. Therefore the accurate reconstruction of temporal resource consumption without detailed, temporized trace information become a major challenge for trace-based workload generation. To address this challenge, we propose STWGEN, a novel method that leverages statistical trace data for workload generation. STWGEN is specifically designed to generate the batch task workloads based on Alibaba trace. STWGEN contains two key components: a suite of C program-based flexible workload building blocks and a heuristic strategy to assemble building blocks for workload generation. Both components are carefully designed to reproduce synthetic batch tasks that closely replicate the observed resource usage patterns in a representative data center. Experimental results demonstrate that STWGEN outperforms state-of-the-art workload generation methods as it emulates workload-level and machine-level resource usage in much higher accuracy.</p></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"4 1","pages":"Article 100164"},"PeriodicalIF":0.0,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772485924000164/pdfft?md5=dc97b50be70f18c4e64b66906a378a03&pid=1-s2.0-S2772485924000164-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141095886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking ChatGPT for Prototyping Theories: Experimental Studies Using the Technology Acceptance Model 以 ChatGPT 为原型理论基准:使用技术接受模型的实验研究
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2024-02-01 DOI: 10.1016/j.tbench.2024.100153
Yanwu Yang, T. Goh, Xin Dai
{"title":"Benchmarking ChatGPT for Prototyping Theories: Experimental Studies Using the Technology Acceptance Model","authors":"Yanwu Yang, T. Goh, Xin Dai","doi":"10.1016/j.tbench.2024.100153","DOIUrl":"https://doi.org/10.1016/j.tbench.2024.100153","url":null,"abstract":"","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"28 6","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139815896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A pluggable single-image super-resolution algorithm based on second-order gradient loss 基于二阶梯度损失的可插入式单图像超分辨率算法
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2023.100148
Shuran Lin , Chunjie Zhang , Yanwu Yang
{"title":"A pluggable single-image super-resolution algorithm based on second-order gradient loss","authors":"Shuran Lin ,&nbsp;Chunjie Zhang ,&nbsp;Yanwu Yang","doi":"10.1016/j.tbench.2023.100148","DOIUrl":"10.1016/j.tbench.2023.100148","url":null,"abstract":"<div><div>Convolutional neural networks for single-image super-resolution have been widely used with great success. However, most of these methods use L1 loss to guide network optimization, resulting in blurry restored images with sharp edges smoothed. This is because L1 loss limits the optimization goal of the network to the statistical average of all solutions within the solution space of that task. To go beyond the L1 loss, this paper designs an image super-resolution algorithm based on second-order gradient loss. We impose additional constraints by considering the high-order gradient level of the image so that the network can focus on the recovery of fine details such as texture during the learning process. This helps to alleviate the problem of restored image texture over-smoothing to some extent. During network training, we extract the second-order gradient map of the generated image and the target image of the network by minimizing the distance between them, this guides the network to pay attention to the high-frequency detail information in the image and generate a high-resolution image with clearer edge and texture. Besides, the proposed loss function has good embeddability and can be easily integrated with existing image super-resolution networks. Experimental results show that the second-order gradient loss can significantly improve both Learned Perceptual Image Patch Similarity (LPIPS) and Frechet Inception Distance score (FID) performance over other image super-resolution deep learning models.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100148"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139022929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CloudAISim: A toolkit for modelling and simulation of modern applications in AI-driven cloud computing environments CloudAISim:在 al-driven 云计算环境中对现代应用进行建模和仿真的工具包
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100150
Abhimanyu Bhowmik , Madhushree Sannigrahi , Deepraj Chowdhury , Ajoy Dey , Sukhpal Singh Gill
{"title":"CloudAISim: A toolkit for modelling and simulation of modern applications in AI-driven cloud computing environments","authors":"Abhimanyu Bhowmik ,&nbsp;Madhushree Sannigrahi ,&nbsp;Deepraj Chowdhury ,&nbsp;Ajoy Dey ,&nbsp;Sukhpal Singh Gill","doi":"10.1016/j.tbench.2024.100150","DOIUrl":"10.1016/j.tbench.2024.100150","url":null,"abstract":"<div><div>There is a very significant knowledge gap between Artificial Intelligence (AI) and a multitude of industries that exist in today’s modern world. This is primarily attributable to the limited availability of resources and technical expertise. However, a major obstacle is that AI needs to be flexible enough to work in many different applications, utilising a wide variety of datasets through cloud computing. As a result, we developed a benchmark toolkit called CloudAISim to make use of the power of AI and cloud computing in order to satisfy the requirements of modern applications. The goal of this study is to come up with a strategy for building a bridge so that AI can be utilised in order to assist those who are not very knowledgeable about technological advancements. In addition, we modelled a healthcare application as a case study in order to verify the scientific reliability of the CloudAISim toolkit and simulated it in a cloud computing environment using Google Cloud Functions to increase its real-time efficiency. A non-expert-friendly interface built with an interactive web app has also been developed. Any user without any technical knowledge can operate the entire model, which has a 98% accuracy rate. The proposed use case is designed to put AI to work in the healthcare industry, but CloudAISim would be useful and adaptable for other applications in the future.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100150"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139457584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Characterizing and understanding deep neural network batching systems on GPUs
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100151
Feng Yu , Hao Zhang , Ao Chen , Xueying Wang , Xiaoxia Liang , Sheng Wang , Guangli Li , Huimin Cui , Xiaobing Feng
{"title":"Characterizing and understanding deep neural network batching systems on GPUs","authors":"Feng Yu ,&nbsp;Hao Zhang ,&nbsp;Ao Chen ,&nbsp;Xueying Wang ,&nbsp;Xiaoxia Liang ,&nbsp;Sheng Wang ,&nbsp;Guangli Li ,&nbsp;Huimin Cui ,&nbsp;Xiaobing Feng","doi":"10.1016/j.tbench.2024.100151","DOIUrl":"10.1016/j.tbench.2024.100151","url":null,"abstract":"<div><div>As neural network inference demands are ever-increasing in intelligent applications, the performance optimization of model serving becomes a challenging problem. Dynamic batching is an important feature of contemporary deep learning serving systems, which combines multiple requests of model inference and executes them together to improve the system’s throughput. However, the behavior characteristics of each part in deep neural network batching systems as well as their performance impact on different model structures are still unknown. In this paper, we characterize the batching system by leveraging three representative deep neural networks on GPUs, performing a systematic analysis of the performance effects from the request batching module, model slicing module, and stage reorchestrating module. Based on experimental results, several insights and recommendations are offered to facilitate the system design and optimization for deep learning serving.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100151"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143294419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Benchmarking ChatGPT for prototyping theories: Experimental studies using the technology acceptance model 以 ChatGPT 为原型理论基准:使用技术接受模型的实验研究
BenchCouncil Transactions on Benchmarks, Standards and Evaluations Pub Date : 2023-12-01 DOI: 10.1016/j.tbench.2024.100153
Tiong-Thye Goh , Xin Dai , Yanwu Yang
{"title":"Benchmarking ChatGPT for prototyping theories: Experimental studies using the technology acceptance model","authors":"Tiong-Thye Goh ,&nbsp;Xin Dai ,&nbsp;Yanwu Yang","doi":"10.1016/j.tbench.2024.100153","DOIUrl":"10.1016/j.tbench.2024.100153","url":null,"abstract":"<div><div>This paper explores the paradigm of leveraging ChatGPT as a benchmark tool for theory prototyping in conceptual research. Specifically, we conducted two experimental studies using the classical technology acceptance model (TAM) to demonstrate and evaluate ChatGPT's capability of comprehending theoretical concepts, discriminating between constructs, and generating meaningful responses. Results of the two studies indicate that ChatGPT can generate responses aligned with the TAM theory and constructs. Key metrics including the factors loading, internal consistency reliability, and convergence reliability of the measurement model surpass the minimum threshold, thus confirming the validity of TAM constructs. Moreover, supported hypotheses provide an evidence for the nomological validity of TAM constructs. However, both of the two studies show a high Heterotrait–Monotrait ratio of correlations (HTMT) among TAM constructs, suggesting a concern about discriminant validity. Furthermore, high duplicated response rates were identified and potential biases regarding gender, usage experiences, perceived usefulness, and behavioural intention were revealed in ChatGPT-generated samples. Therefore, it calls for additional efforts in LLM to address performance metrics related to duplicated responses, the strength of discriminant validity, the impact of prompt design, and the generalizability of findings across contexts.</div></div>","PeriodicalId":100155,"journal":{"name":"BenchCouncil Transactions on Benchmarks, Standards and Evaluations","volume":"3 4","pages":"Article 100153"},"PeriodicalIF":0.0,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139875928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信