{"title":"均值上下文嵌入的规范决定其方差","authors":"Hiroaki Yamagiwa, Hidetoshi Shimodaira","doi":"arxiv-2409.11253","DOIUrl":null,"url":null,"abstract":"Contextualized embeddings vary by context, even for the same token, and form\na distribution in the embedding space. To analyze this distribution, we focus\non the norm of the mean embedding and the variance of the embeddings. In this\nstudy, we first demonstrate that these values follow the well-known formula for\nvariance in statistics and provide an efficient sequential computation method.\nThen, by observing embeddings from intermediate layers of several Transformer\nmodels, we found a strong trade-off relationship between the norm and the\nvariance: as the mean embedding becomes closer to the origin, the variance\nincreases. This trade-off is likely influenced by the layer normalization\nmechanism used in Transformer models. Furthermore, when the sets of token\nembeddings are treated as clusters, we show that the variance of the entire\nembedding set can theoretically be decomposed into the within-cluster variance\nand the between-cluster variance. We found experimentally that as the layers of\nTransformer models deepen, the embeddings move farther from the origin, the\nbetween-cluster variance relatively decreases, and the within-cluster variance\nrelatively increases. These results are consistent with existing studies on the\nanisotropy of the embedding spaces across layers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Norm of Mean Contextualized Embeddings Determines their Variance\",\"authors\":\"Hiroaki Yamagiwa, Hidetoshi Shimodaira\",\"doi\":\"arxiv-2409.11253\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Contextualized embeddings vary by context, even for the same token, and form\\na distribution in the embedding space. To analyze this distribution, we focus\\non the norm of the mean embedding and the variance of the embeddings. In this\\nstudy, we first demonstrate that these values follow the well-known formula for\\nvariance in statistics and provide an efficient sequential computation method.\\nThen, by observing embeddings from intermediate layers of several Transformer\\nmodels, we found a strong trade-off relationship between the norm and the\\nvariance: as the mean embedding becomes closer to the origin, the variance\\nincreases. This trade-off is likely influenced by the layer normalization\\nmechanism used in Transformer models. Furthermore, when the sets of token\\nembeddings are treated as clusters, we show that the variance of the entire\\nembedding set can theoretically be decomposed into the within-cluster variance\\nand the between-cluster variance. We found experimentally that as the layers of\\nTransformer models deepen, the embeddings move farther from the origin, the\\nbetween-cluster variance relatively decreases, and the within-cluster variance\\nrelatively increases. These results are consistent with existing studies on the\\nanisotropy of the embedding spaces across layers.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11253\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11253","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Norm of Mean Contextualized Embeddings Determines their Variance
Contextualized embeddings vary by context, even for the same token, and form
a distribution in the embedding space. To analyze this distribution, we focus
on the norm of the mean embedding and the variance of the embeddings. In this
study, we first demonstrate that these values follow the well-known formula for
variance in statistics and provide an efficient sequential computation method.
Then, by observing embeddings from intermediate layers of several Transformer
models, we found a strong trade-off relationship between the norm and the
variance: as the mean embedding becomes closer to the origin, the variance
increases. This trade-off is likely influenced by the layer normalization
mechanism used in Transformer models. Furthermore, when the sets of token
embeddings are treated as clusters, we show that the variance of the entire
embedding set can theoretically be decomposed into the within-cluster variance
and the between-cluster variance. We found experimentally that as the layers of
Transformer models deepen, the embeddings move farther from the origin, the
between-cluster variance relatively decreases, and the within-cluster variance
relatively increases. These results are consistent with existing studies on the
anisotropy of the embedding spaces across layers.