Décio Luiz Gazzoni Filho, Guilherme Brandão, Julio López Hernandez
{"title":"使用矩阵乘法加速器进行快速多项式乘法,并在 Apple M1/M3 SoC 上应用于 NTRU","authors":"Décio Luiz Gazzoni Filho, Guilherme Brandão, Julio López Hernandez","doi":"10.62056/a3txommol","DOIUrl":null,"url":null,"abstract":"Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by \n \n 1.54\n \n –\n \n 3.07\n ×\n \n , \n \n 1.08\n \n –\n \n 1.33\n ×\n \n , \n \n 1.11\n \n –\n \n 1.50\n ×\n \n and \n \n 1.20\n \n –\n \n 1.98\n ×\n \n , respectively, over the previous state-of-the-art.","PeriodicalId":508905,"journal":{"name":"IACR Cryptol. ePrint Arch.","volume":"75 8","pages":"2"},"PeriodicalIF":0.0000,"publicationDate":"2024-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs\",\"authors\":\"Décio Luiz Gazzoni Filho, Guilherme Brandão, Julio López Hernandez\",\"doi\":\"10.62056/a3txommol\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by \\n \\n 1.54\\n \\n –\\n \\n 3.07\\n ×\\n \\n , \\n \\n 1.08\\n \\n –\\n \\n 1.33\\n ×\\n \\n , \\n \\n 1.11\\n \\n –\\n \\n 1.50\\n ×\\n \\n and \\n \\n 1.20\\n \\n –\\n \\n 1.98\\n ×\\n \\n , respectively, over the previous state-of-the-art.\",\"PeriodicalId\":508905,\"journal\":{\"name\":\"IACR Cryptol. ePrint Arch.\",\"volume\":\"75 8\",\"pages\":\"2\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IACR Cryptol. ePrint Arch.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.62056/a3txommol\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IACR Cryptol. ePrint Arch.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.62056/a3txommol","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fast polynomial multiplication using matrix multiplication accelerators with applications to NTRU on Apple M1/M3 SoCs
Efficient polynomial multiplication routines are critical to the performance of lattice-based post-quantum cryptography (PQC). As PQC standards only recently started to emerge, CPUs still lack specialized instructions to accelerate such routines. Meanwhile, deep learning has grown immeasurably in importance. Its workloads call for teraflops-level of processing power for linear algebra operations, mainly matrix multiplication. Computer architects have responded by introducing ISA extensions, coprocessors and special-purpose cores to accelerate such operations. In particular, Apple ships an undocumented matrix-multiplication coprocessor, AMX, in hundreds of millions of mobile phones, tablets and personal computers. Our work repurposes AMX to implement polynomial multiplication and applies it to the NTRU cryptosystem, setting new speed records on the Apple M1 and M3 systems-on-chip (SoCs): polynomial multiplication, key generation, encapsulation and decapsulation are sped up by
1.54
–
3.07
×
,
1.08
–
1.33
×
,
1.11
–
1.50
×
and
1.20
–
1.98
×
, respectively, over the previous state-of-the-art.