A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study

IF 6.2 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS
Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno
{"title":"A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study","authors":"Sandra Catalán ,&nbsp;Rafael Rodríguez-Sánchez ,&nbsp;Carlos García Sánchez ,&nbsp;Luis Piñuel Moreno","doi":"10.1016/j.future.2026.108393","DOIUrl":null,"url":null,"abstract":"<div><div>This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (<span>GEMM</span>). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).</div><div>The assessments use the <span>GEMM</span> as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.</div><div>This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108393"},"PeriodicalIF":6.2000,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X26000270","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0

Abstract

This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (GEMM). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).
The assessments use the GEMM as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.
This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.
苹果M架构的性能和效率比较分析:一个GEMM案例研究
本文评估了苹果处理器在多个基于arm的m系列世代和型号(标准和专业)中的性能和能效。这项研究的动机是苹果SoC架构日益增加的异质性,它集成了多个计算引擎,提出了哪个硬件组件最适合执行通用和特定领域的计算(如通用矩阵乘法(GEMM))的科学问题。分析集中在四个关键组件上:中央处理器(CPU)、图形处理单元(GPU)、矩阵计算加速器(AMX)和苹果神经引擎(ANE)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
19.90
自引率
2.70%
发文量
376
审稿时长
10.6 months
期刊介绍: Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书