A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study

IF 6.2 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

Future Generation Computer Systems-The International Journal of Escience Pub Date : 2026-07-01 Epub Date: 2026-01-24 DOI:10.1016/j.future.2026.108393

Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno

{"title":"A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study","authors":"Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno","doi":"10.1016/j.future.2026.108393","DOIUrl":null,"url":null,"abstract":"<div><div>This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (<span>GEMM</span>). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).</div><div>The assessments use the <span>GEMM</span> as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.</div><div>This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108393"},"PeriodicalIF":6.2000,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X26000270","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (GEMM). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).

The assessments use the GEMM as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.

This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.

查看原文本刊更多论文

苹果M架构的性能和效率比较分析：一个GEMM案例研究

本文评估了苹果处理器在多个基于arm的m系列世代和型号（标准和专业）中的性能和能效。这项研究的动机是苹果SoC架构日益增加的异质性，它集成了多个计算引擎，提出了哪个硬件组件最适合执行通用和特定领域的计算（如通用矩阵乘法（GEMM））的科学问题。分析集中在四个关键组件上：中央处理器（CPU）、图形处理单元（GPU）、矩阵计算加速器（AMX）和苹果神经引擎（ANE）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Future Generation Computer Systems-The International Journal of Escience 工程技术-计算机：理论方法

CiteScore

19.90

自引率

2.70%

发文量

376

审稿时长

10.6 months

期刊介绍： Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications. Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration. Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.