The Information of Large Language Model Geometry

arXiv - CS - Information Theory Pub Date : 2024-02-01 DOI:arxiv-2402.03471

Zhiquan Tan, Chenghai Li, Weiran Huang

引用次数: 0

Abstract

This paper investigates the information encoded in the embeddings of large language models (LLMs). We conduct simulations to analyze the representation entropy and discover a power law relationship with model sizes. Building upon this observation, we propose a theory based on (conditional) entropy to elucidate the scaling law phenomenon. Furthermore, we delve into the auto-regressive structure of LLMs and examine the relationship between the last token and previous context tokens using information theory and regression techniques. Specifically, we establish a theoretical connection between the information gain of new tokens and ridge regression. Additionally, we explore the effectiveness of Lasso regression in selecting meaningful tokens, which sometimes outperforms the closely related attention weights. Finally, we conduct controlled experiments, and find that information is distributed across tokens, rather than being concentrated in specific "meaningful" tokens alone.

查看原文本刊更多论文

大语言模型几何的信息

本文研究了大型语言模型（LLMs）的嵌入中编码的信息。我们通过模拟来分析表征熵，发现它与模型大小之间存在幂律关系。基于这一观察结果，我们提出了一种基于（条件）熵的理论来解释缩放定律现象。此外，我们还深入研究了 LLM 的自回归结构，并利用信息论和回归技术研究了最后一个标记与之前上下文标记之间的关系。具体来说，我们在新标记的信息增益和脊回归之间建立了理论联系。此外，我们还探索了 Lasso 回归在选择有意义标记方面的有效性，其效果有时优于密切相关的注意力权重。最后，我们进行了对照实验，发现信息是分布在各个词块上的，而不是仅仅集中在特定的 "有意义 "词块上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Information Theory

自引率

0.00%

发文量