Survey and Evaluation of Converging Architecture in LLMs Based on Footsteps of Operations

IEEE Open Journal of the Computer Society Pub Date : 2025-07-08 DOI:10.1109/OJCS.2025.3587005

Seongho Kim;Jihyun Moon;Juntaek Oh;Insu Choi;Joon-Sung Yang

{"title":"Survey and Evaluation of Converging Architecture in LLMs Based on Footsteps of Operations","authors":"Seongho Kim;Jihyun Moon;Juntaek Oh;Insu Choi;Joon-Sung Yang","doi":"10.1109/OJCS.2025.3587005","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs), which have emerged from advances in natural language processing (NLP), enable chatbots, virtual assistants, and numerous domain-specific applications. These models, often comprising billions of parameters, leverage the Transformer architecture and Attention mechanisms to process context effectively and address long-term dependencies more efficiently than earlier approaches, such as recurrent neural networks (RNNs). Notably, since the introduction of Llama, the architectural development of LLMs has significantly converged, predominantly settling on a Transformer-based decoder-only architecture. The evolution of LLMs has been driven by advances in high-bandwidth memory, specialized accelerators, and optimized architectures, enabling models to scale to billions of parameters. However, it also introduces new challenges: meeting compute and memory efficiency requirements across diverse deployment targets, ranging from data center servers to resource-constrained edge devices. To address these challenges, we survey the evolution of LLMs at two complementary levels: architectural trends and their underlying operational mechanisms. Furthermore, we quantify how hyperparameter settings influence inference latency by profiling kernel-level execution on a modern GPU architecture. Our findings reveal that identical models can exhibit varying performance based on hyperparameter configurations and deployment contexts, emphasizing the need for scalable and efficient solutions. The insights distilled from this analysis guide the optimization of performance and efficiency within these converged LLM architectures, thereby extending their applicability across a broader range of environments.","PeriodicalId":13205,"journal":{"name":"IEEE Open Journal of the Computer Society","volume":"6 ","pages":"1214-1226"},"PeriodicalIF":0.0000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11072851","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Open Journal of the Computer Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11072851/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs), which have emerged from advances in natural language processing (NLP), enable chatbots, virtual assistants, and numerous domain-specific applications. These models, often comprising billions of parameters, leverage the Transformer architecture and Attention mechanisms to process context effectively and address long-term dependencies more efficiently than earlier approaches, such as recurrent neural networks (RNNs). Notably, since the introduction of Llama, the architectural development of LLMs has significantly converged, predominantly settling on a Transformer-based decoder-only architecture. The evolution of LLMs has been driven by advances in high-bandwidth memory, specialized accelerators, and optimized architectures, enabling models to scale to billions of parameters. However, it also introduces new challenges: meeting compute and memory efficiency requirements across diverse deployment targets, ranging from data center servers to resource-constrained edge devices. To address these challenges, we survey the evolution of LLMs at two complementary levels: architectural trends and their underlying operational mechanisms. Furthermore, we quantify how hyperparameter settings influence inference latency by profiling kernel-level execution on a modern GPU architecture. Our findings reveal that identical models can exhibit varying performance based on hyperparameter configurations and deployment contexts, emphasizing the need for scalable and efficient solutions. The insights distilled from this analysis guide the optimization of performance and efficiency within these converged LLM architectures, thereby extending their applicability across a broader range of environments.

查看原文本刊更多论文

基于操作脚步的法学硕士融合体系结构综述与评价

从自然语言处理（NLP）的进步中出现的大型语言模型（llm）支持聊天机器人、虚拟助手和许多特定领域的应用程序。这些模型通常包含数十亿个参数，它们利用Transformer架构和注意力机制来有效地处理上下文，并比早期的方法（如循环神经网络（rnn））更有效地处理长期依赖关系。值得注意的是，自从引入Llama以来，llm的架构开发已经明显地融合在一起，主要建立在基于transformer的仅解码器的架构上。高带宽内存、专用加速器和优化架构的进步推动了llm的发展，使模型能够扩展到数十亿个参数。然而，它也带来了新的挑战：满足不同部署目标（从数据中心服务器到资源受限的边缘设备）的计算和内存效率要求。为了应对这些挑战，我们从两个互补的层面调查法学硕士的发展：架构趋势及其潜在的操作机制。此外，我们通过分析现代GPU架构上的内核级执行来量化超参数设置如何影响推理延迟。我们的研究结果表明，基于超参数配置和部署上下文，相同的模型可以表现出不同的性能，这强调了对可扩展和高效解决方案的需求。从该分析中提取的见解指导了这些融合LLM体系结构中的性能和效率的优化，从而扩展了它们在更广泛的环境中的适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Open Journal of the Computer Society

CiteScore

12.60

自引率

0.00%

发文量