创建一个大规模的俄语儿童双语和儿童导向语音的音频对齐解析语料库:挑战,解决方案和研究意义

Q2 Arts and Humanities
Alex Lưu, Pasha Koval, Sophia A. Malamud, Irina Y. Dubinina
{"title":"创建一个大规模的俄语儿童双语和儿童导向语音的音频对齐解析语料库:挑战,解决方案和研究意义","authors":"Alex Lưu, Pasha Koval, Sophia A. Malamud, Irina Y. Dubinina","doi":"10.1590/2176-4573e55831","DOIUrl":null,"url":null,"abstract":"ABSTRACT The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully textsearchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.","PeriodicalId":37906,"journal":{"name":"Bakhtiniana","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research\",\"authors\":\"Alex Lưu, Pasha Koval, Sophia A. Malamud, Irina Y. Dubinina\",\"doi\":\"10.1590/2176-4573e55831\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ABSTRACT The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully textsearchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.\",\"PeriodicalId\":37906,\"journal\":{\"name\":\"Bakhtiniana\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bakhtiniana\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1590/2176-4573e55831\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"Arts and Humanities\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bakhtiniana","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1590/2176-4573e55831","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Arts and Humanities","Score":null,"Total":0}
引用次数: 0

摘要

BiRCh项目(双语俄语儿童语料库)涉及收集俄罗斯、乌克兰、德国、美国和加拿大儿童及其家庭使用的俄语纵向音频语料库。我们正在基于这些数据的一个子集构建一个大规模的语料库,“俄语双语儿童和儿童导向语音的解析和音频对齐语料库(BiRCh)”,它有两个基本组成部分:(1)100万字的转录本,与音频语音信号时间对齐,完全可文本搜索;(2)50万字的转录本的形态注释和解析部分,也与音频对齐。我们使用这个语料库来研究传统双语者的语言输入和发展轨迹中的各种现象,例如:格、性别、被动、非人格、礼貌标记、不流利和话语标记。本文重点讨论了桦树语言开发面临的挑战和解决方案,以及对语料库提供的丰富注释数据研究的启示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Creating a Large-Scale Audio-Aligned Parsed Corpus of Bilingual Russian Child and Child-Directed Speech (BiRCh): Challenges, Solutions, and Implications for Research
ABSTRACT The BiRCh Project (The Corpus of Bilingual Russian Child Speech) involves collecting a longitudinal audio corpus of Russian spoken by children and their families in Russia, Ukraine, Germany, the U.S., and Canada. We are building a large-scale corpus based on a subset of this data, the “Parsed and Audio-aligned Corpus of Bilingual Russian Child and Child-directed Speech (BiRCh)” with two basic components: (1) 1-million-word transcripts which are time-aligned with the audio speech signal and fully textsearchable, and (2) a 500K-word morphologically annotated and parsed portion of the transcripts, also audio-aligned. We are using this corpus to investigate various phenomena in the linguistic input and the developmental trajectory of heritage bilinguals, e.g., case, gender, passives, impersonals, politeness markers, disfluencies, and discourse markers. This article focuses on the challenges and solutions of the BiRCh development and the implications for research on the richly annotated data provided by the corpus.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Bakhtiniana
Bakhtiniana Arts and Humanities-Literature and Literary Theory
CiteScore
0.20
自引率
0.00%
发文量
69
审稿时长
12 weeks
期刊介绍: Bakhtiniana. Revista de Estudos do Discurso[Bakhtiniana. Journal of Discourse Studies], in electronic format, was created in 2008 by Programa de Estudos Pós-Graduados em Linguística Aplicada e Estudos da Linguagem [the Applied Linguistics and Language Studies Graduate Program] of Pontifícia Universidade Católica de São Paulo/LAEL-PUCSP and by the members of Linguagem, identidade e memória [Language, Identity and Memory] Research Group/CNPq (National Council for Scientific and Technological Development). The journal''s mission is to promote and to publicize research on discourse, mainly on dialogic studies. From 2019 on, it will publish an issue every three months. Each issue is composed of papers and book reviews written by professors and Phd researchers from international and national universities. This is the only journal that covers Bakhtinian studiesper seand that dialogues with other areas of knowledge in Brazil and abroad.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信