{"title":"Towards a theory of how the structure of language is acquired by deep neural networks","authors":"Francesco Cagnetta, Matthieu Wyart","doi":"arxiv-2406.00048","DOIUrl":null,"url":null,"abstract":"How much data is required to learn the structure of a language via next-token\nprediction? We study this question for synthetic datasets generated via a\nProbabilistic Context-Free Grammar (PCFG) -- a hierarchical generative model\nthat captures the tree-like structure of natural languages. We determine\ntoken-token correlations analytically in our model and show that they can be\nused to build a representation of the grammar's hidden variables, the longer\nthe range the deeper the variable. In addition, a finite training set limits\nthe resolution of correlations to an effective range, whose size grows with\nthat of the training set. As a result, a Language Model trained with\nincreasingly many examples can build a deeper representation of the grammar's\nstructure, thus reaching good performance despite the high dimensionality of\nthe problem. We conjecture that the relationship between training set size and\neffective range of correlations holds beyond our synthetic datasets. In\nparticular, our conjecture predicts how the scaling law for the test loss\nbehaviour with training set size depends on the length of the context window,\nwhich we confirm empirically for a collection of lines from Shakespeare's\nplays.","PeriodicalId":501066,"journal":{"name":"arXiv - PHYS - Disordered Systems and Neural Networks","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - PHYS - Disordered Systems and Neural Networks","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.00048","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
How much data is required to learn the structure of a language via next-token
prediction? We study this question for synthetic datasets generated via a
Probabilistic Context-Free Grammar (PCFG) -- a hierarchical generative model
that captures the tree-like structure of natural languages. We determine
token-token correlations analytically in our model and show that they can be
used to build a representation of the grammar's hidden variables, the longer
the range the deeper the variable. In addition, a finite training set limits
the resolution of correlations to an effective range, whose size grows with
that of the training set. As a result, a Language Model trained with
increasingly many examples can build a deeper representation of the grammar's
structure, thus reaching good performance despite the high dimensionality of
the problem. We conjecture that the relationship between training set size and
effective range of correlations holds beyond our synthetic datasets. In
particular, our conjecture predicts how the scaling law for the test loss
behaviour with training set size depends on the length of the context window,
which we confirm empirically for a collection of lines from Shakespeare's
plays.