Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno
{"title":"A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study","authors":"Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno","doi":"10.1016/j.future.2026.108393","DOIUrl":null,"url":null,"abstract":"<div><div>This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (<span>GEMM</span>). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).</div><div>The assessments use the <span>GEMM</span> as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.</div><div>This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108393"},"PeriodicalIF":6.2000,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Future Generation Computer Systems-The International Journal of Escience","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167739X26000270","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/1/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (GEMM). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).
The assessments use the GEMM as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.
This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.
期刊介绍:
Computing infrastructures and systems are constantly evolving, resulting in increasingly complex and collaborative scientific applications. To cope with these advancements, there is a growing need for collaborative tools that can effectively map, control, and execute these applications.
Furthermore, with the explosion of Big Data, there is a requirement for innovative methods and infrastructures to collect, analyze, and derive meaningful insights from the vast amount of data generated. This necessitates the integration of computational and storage capabilities, databases, sensors, and human collaboration.
Future Generation Computer Systems aims to pioneer advancements in distributed systems, collaborative environments, high-performance computing, and Big Data analytics. It strives to stay at the forefront of developments in grids, clouds, and the Internet of Things (IoT) to effectively address the challenges posed by these wide-area, fully distributed sensing and computing systems.