A pipeline for stochastic and controlled generation of realistic language input for simulating infant language acquisition.

IF 3.9 2区心理学 Q1 PSYCHOLOGY, EXPERIMENTAL

Behavior Research Methods Pub Date : 2025-09-04 DOI:10.3758/s13428-025-02772-6

Okko Räsänen, Daniil Kocharov

{"title":"A pipeline for stochastic and controlled generation of realistic language input for simulating infant language acquisition.","authors":"Okko Räsänen, Daniil Kocharov","doi":"10.3758/s13428-025-02772-6","DOIUrl":null,"url":null,"abstract":"<p><p>Computational models of early language development involve implementing theories of learning as functional learning algorithms, exposing these models to realistic language input, and comparing learning outcomes to those in infants. While recent research has made major strides in developing more powerful learning models and evaluation protocols grounded in infant data, models are still predominantly trained with non-naturalistic input data, such as crowd-sourced read speech or text transcripts. This is due to the lack of suitable child-directed speech (CDS) corpora in terms of scale and quality. In parallel, the question of how properties and individual variability in language input affect learning outcomes is an active area of empirical research, underlining the need for realistic yet controllable data for modeling such phenomena. This paper presents a solution to the training data problem through stochastic generation of naturalistic CDS data using statistical models, thereby enabling controlled computational simulations with naturalistic input. We provide a proof-of-concept demonstration of the approach by showing how naturalistic CDS transcripts can be generated with a language model conditioned on recipient information (here, infant age), and how text-to-speech systems can be used to convert the transcripts to high-quality speech with a controllable speaking style. We also conduct modeling experiments with generated speech corpora by varying different aspects of the data, showing how this maps into different learning outcomes, thereby demonstrating the feasibility of the approach for controlled language learning simulations. Finally, we discuss the limitations of using synthetic data in general, and of the present proof-of-concept pipeline in particular.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"57 10","pages":"275"},"PeriodicalIF":3.9000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12411597/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Behavior Research Methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13428-025-02772-6","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Computational models of early language development involve implementing theories of learning as functional learning algorithms, exposing these models to realistic language input, and comparing learning outcomes to those in infants. While recent research has made major strides in developing more powerful learning models and evaluation protocols grounded in infant data, models are still predominantly trained with non-naturalistic input data, such as crowd-sourced read speech or text transcripts. This is due to the lack of suitable child-directed speech (CDS) corpora in terms of scale and quality. In parallel, the question of how properties and individual variability in language input affect learning outcomes is an active area of empirical research, underlining the need for realistic yet controllable data for modeling such phenomena. This paper presents a solution to the training data problem through stochastic generation of naturalistic CDS data using statistical models, thereby enabling controlled computational simulations with naturalistic input. We provide a proof-of-concept demonstration of the approach by showing how naturalistic CDS transcripts can be generated with a language model conditioned on recipient information (here, infant age), and how text-to-speech systems can be used to convert the transcripts to high-quality speech with a controllable speaking style. We also conduct modeling experiments with generated speech corpora by varying different aspects of the data, showing how this maps into different learning outcomes, thereby demonstrating the feasibility of the approach for controlled language learning simulations. Finally, we discuss the limitations of using synthetic data in general, and of the present proof-of-concept pipeline in particular.

Abstract Image

查看原文本刊更多论文

一种用于模拟幼儿语言习得的真实语言输入的随机和受控生成管道。

早期语言发展的计算模型包括将学习理论作为功能学习算法实施，将这些模型暴露于现实的语言输入，并将学习结果与婴儿的学习结果进行比较。虽然最近的研究在开发基于婴儿数据的更强大的学习模型和评估协议方面取得了重大进展，但模型仍然主要使用非自然输入数据进行训练，例如众包阅读语音或文本文本。这是由于在规模和质量方面缺乏合适的儿童导向语（CDS）语料库。与此同时，语言输入的特性和个体差异如何影响学习结果的问题是实证研究的一个活跃领域，强调了对这种现象建模的现实而可控的数据的需求。本文提出了一种通过使用统计模型随机生成自然CDS数据来解决训练数据问题的方法，从而实现了具有自然输入的受控计算模拟。我们提供了该方法的概念验证演示，展示了如何使用以收件人信息（这里是婴儿年龄）为条件的语言模型生成自然的CDS转录本，以及如何使用文本到语音系统将转录本转换为具有可控说话风格的高质量语音。我们还通过改变数据的不同方面对生成的语音语料库进行了建模实验，展示了这如何映射到不同的学习结果，从而证明了该方法用于受控语言学习模拟的可行性。最后，我们讨论了一般使用合成数据的局限性，特别是目前的概念验证管道的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Behavior Research Methods Multiple-

CiteScore

10.30

自引率

9.30%

发文量

266

期刊介绍： Behavior Research Methods publishes articles concerned with the methods, techniques, and instrumentation of research in experimental psychology. The journal focuses particularly on the use of computer technology in psychological research. An annual special issue is devoted to this field.