{"title":"基于语料库的第二语言口语生成方法","authors":"V. Brezina, Dana Gablasova, Tony McEnery","doi":"10.1075/ijlcr.00008.int","DOIUrl":null,"url":null,"abstract":"From the perspective of the compilers, a corpus is a journey. This particular journey – the process of the design and compilation of the Trinity Lancaster Corpus (TLC), the largest spoken learner corpus of (interactive) English to date – took over five years. It involved more than 3,500 hours of transcription time1 with many more hours spent on quality checking and post-processing of the data. This simple statistic shows why learner corpora of spoken language are still relatively rare, despite the fact that they provide a unique insight into spontaneous language production (McEnery, Brezina, Gablasova & Banerjee 2019). While the advances in computational technology allow better data processing and more efficient analysis, the starting point of a spoken (learner) corpus is still the recording of speech and its manual transcription. This method is considerably more reliable in capturing the details of spoken language than any existing voice recognition system. This is true for spoken L1 (McEnery 2018) as well as spoken L2 data (Gilquin 2015). The difference between the performance of an experienced transcriber and a state-ofthe-art automated system is immediately obvious from the comparison shown in Table 1. For meaningful linguistic analysis, only the sample transcript shown on the left (from the TLC) is suitable as it represents an accurate account of the spoken production. Building a spoken learner corpus is thus a resource-intensive project. The compilation of the TLC was made possible by research collaboration between Lancaster University and Trinity College London, a major international testing board. The project was supported by the Economic and Social Research Council (ESRC) and Trinity College London.2","PeriodicalId":29715,"journal":{"name":"International Journal of Learner Corpus Research","volume":null,"pages":null},"PeriodicalIF":1.1000,"publicationDate":"2019-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Corpus-based Approaches to Spoken L2 Production\",\"authors\":\"V. Brezina, Dana Gablasova, Tony McEnery\",\"doi\":\"10.1075/ijlcr.00008.int\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"From the perspective of the compilers, a corpus is a journey. This particular journey – the process of the design and compilation of the Trinity Lancaster Corpus (TLC), the largest spoken learner corpus of (interactive) English to date – took over five years. It involved more than 3,500 hours of transcription time1 with many more hours spent on quality checking and post-processing of the data. This simple statistic shows why learner corpora of spoken language are still relatively rare, despite the fact that they provide a unique insight into spontaneous language production (McEnery, Brezina, Gablasova & Banerjee 2019). While the advances in computational technology allow better data processing and more efficient analysis, the starting point of a spoken (learner) corpus is still the recording of speech and its manual transcription. This method is considerably more reliable in capturing the details of spoken language than any existing voice recognition system. This is true for spoken L1 (McEnery 2018) as well as spoken L2 data (Gilquin 2015). The difference between the performance of an experienced transcriber and a state-ofthe-art automated system is immediately obvious from the comparison shown in Table 1. For meaningful linguistic analysis, only the sample transcript shown on the left (from the TLC) is suitable as it represents an accurate account of the spoken production. Building a spoken learner corpus is thus a resource-intensive project. The compilation of the TLC was made possible by research collaboration between Lancaster University and Trinity College London, a major international testing board. The project was supported by the Economic and Social Research Council (ESRC) and Trinity College London.2\",\"PeriodicalId\":29715,\"journal\":{\"name\":\"International Journal of Learner Corpus Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.1000,\"publicationDate\":\"2019-09-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Learner Corpus Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1075/ijlcr.00008.int\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"0\",\"JCRName\":\"LANGUAGE & LINGUISTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Learner Corpus Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1075/ijlcr.00008.int","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
From the perspective of the compilers, a corpus is a journey. This particular journey – the process of the design and compilation of the Trinity Lancaster Corpus (TLC), the largest spoken learner corpus of (interactive) English to date – took over five years. It involved more than 3,500 hours of transcription time1 with many more hours spent on quality checking and post-processing of the data. This simple statistic shows why learner corpora of spoken language are still relatively rare, despite the fact that they provide a unique insight into spontaneous language production (McEnery, Brezina, Gablasova & Banerjee 2019). While the advances in computational technology allow better data processing and more efficient analysis, the starting point of a spoken (learner) corpus is still the recording of speech and its manual transcription. This method is considerably more reliable in capturing the details of spoken language than any existing voice recognition system. This is true for spoken L1 (McEnery 2018) as well as spoken L2 data (Gilquin 2015). The difference between the performance of an experienced transcriber and a state-ofthe-art automated system is immediately obvious from the comparison shown in Table 1. For meaningful linguistic analysis, only the sample transcript shown on the left (from the TLC) is suitable as it represents an accurate account of the spoken production. Building a spoken learner corpus is thus a resource-intensive project. The compilation of the TLC was made possible by research collaboration between Lancaster University and Trinity College London, a major international testing board. The project was supported by the Economic and Social Research Council (ESRC) and Trinity College London.2