Statistical Validity of Neural-Net Benchmarks

Alain Hadges;Srikar Bellur
{"title":"Statistical Validity of Neural-Net Benchmarks","authors":"Alain Hadges;Srikar Bellur","doi":"10.1109/OJCS.2024.3523183","DOIUrl":null,"url":null,"abstract":"Claims of better, faster or more efficient neural-net designs often hinge on low single digit percentage improvements (or less) in accuracy or speed compared to others. Current benchmark differences used for comparison have been based on a number of different metrics such as recall, the best of five-runs, the median of five runs, Top-1, Top-5, BLEU, ROC, RMS, etc. These metrics implicitly assert comparable distributions of metrics. Conspicuous by their absence are measures of statistical validity of these benchmark comparisons. This study examined neural-net benchmark metric distributions and determined there are researcher degrees of freedom that may affect comparison validity. An essay is developed and proposed for benchmarking and comparing reasonably expected neural-net performance metrics that minimizes researcher degrees of freedom. The essay includes an estimate of the effects and the interactions of hyper-parameter settings on the benchmark metrics of a neural-net as a measure of its optimization complexity.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"211-222"},"PeriodicalIF":0.0000,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10816528","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10816528/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Claims of better, faster or more efficient neural-net designs often hinge on low single digit percentage improvements (or less) in accuracy or speed compared to others. Current benchmark differences used for comparison have been based on a number of different metrics such as recall, the best of five-runs, the median of five runs, Top-1, Top-5, BLEU, ROC, RMS, etc. These metrics implicitly assert comparable distributions of metrics. Conspicuous by their absence are measures of statistical validity of these benchmark comparisons. This study examined neural-net benchmark metric distributions and determined there are researcher degrees of freedom that may affect comparison validity. An essay is developed and proposed for benchmarking and comparing reasonably expected neural-net performance metrics that minimizes researcher degrees of freedom. The essay includes an estimate of the effects and the interactions of hyper-parameter settings on the benchmark metrics of a neural-net as a measure of its optimization complexity.
神经网络基准的统计有效性
声称更好、更快或更高效的神经网络设计往往取决于与其他设计相比,在准确性或速度方面的低个位数百分比改进(或更少)。目前用于比较的基准差异是基于许多不同的指标,如召回率、五次运行的最佳结果、五次运行的中位数、Top-1、Top-5、BLEU、ROC、RMS等。这些指标隐含地断言了指标的可比较分布。值得注意的是,缺乏这些基准比较的统计有效性度量。本研究检验了神经网络基准度量分布,并确定存在可能影响比较效度的研究者自由度。开发并提出了一篇论文,用于基准测试和比较合理预期的神经网络性能指标,以最大限度地减少研究人员的自由度。本文包括对神经网络基准度量的超参数设置的影响和相互作用的估计,作为其优化复杂性的度量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
12.60
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信