英国广播公司之声数据分析:当代英语方言区及其特征词汇变体

Lit. Linguistic Comput. Pub Date : 2014-04-01 DOI:10.1093/llc/fqt009

Martijn Wieling, Clive Upton, Ann Thompson

{"title":"英国广播公司之声数据分析:当代英语方言区及其特征词汇变体","authors":"Martijn Wieling, Clive Upton, Ann Thompson","doi":"10.1093/llc/fqt009","DOIUrl":null,"url":null,"abstract":"This study investigates data from the BBC Voices project which contains a large amount of vernacular data collected by the BBC between 2004 and 2005. This project was designed primarily to collect information on vernacular speech around the United Kingdom for broadcasting purposes. As part of the project, a web-based questionnaire was created, to which tens of thousands of people supplied their way of denoting thirty-eight concepts which were known to exhibit marked lexical variation. Along with their variants, those responding to the online prompts provided information on their age, gender, and —significantly for this study— their location, this being recorded by means of their postcode. In this study we focus on the relative frequency of the top-ten variants for all concepts in every postcode area. By using hierarchical spectral partitioning of bipartite graphs, we are able to identify four contemporary geographical dialect areas together with their characteristic lexical variants. Even though these variants can be said to characterize their respective geographical area, they also occur in other areas, and not all people in a certain region use the characteristic variant. This supports the view that dialect regions are not clearly defined by strict borders, but are fuzzy at best. Introduction In 2004 and 2005, the British Broadcasting Corporation conducted a large-scale survey in order to obtain a contemporary view of English dialectal variation. People visiting a speciallyconstructed website were invited to offer their variants for thirty-eight concepts that were known to exhibit marked lexical variation. Along with their lexical use, informants were asked to provide details of their age, gender, and geographical (post-coded) location. Upwards of 29,000 people participated in this project (“BBC Voices”) to a greater or lesser degree, resulting in a substantial electronic dataset as a consequence. As dialectologists we are interested in investigating geographical structure which might be present in our data. Given the enormous size of the Voices lexical dataset (containing more than 700,000 responses in total), we use quantitative methods from dialectometry to provide an aggregate view of the contemporary English dialectal landscape. Dialectometry originated in the 1970’s (Séguy, 1973) to provide a more objective method of identifying dialect differences than by “cherry-picking” the features which support the analysis one wishes to settle on (Nerbonne, 2009). Unfortunately, dialectometry has not been received very favorably by some traditional dialectologists, as aggregate analyses obscure the importance of individual linguistic features, on which they are required to focus for their often philologically-directed purposes. Consequently, there have been a number of attempts to develop quantitative methods which enable the identification of characteristic linguistic variables. For example, Shackleton (2007) uses cluster analysis and principal component analysis (PCA) to identify linguistic variables which show a specific geographic distribution, while Grieve et al. (2011) uses spatial autocorrelation to detect significant geographical patterns in forty individual lexical alternation variables. Prokić et al. (2012) examine each item in a dataset, seeking those that differ minimally with a candidate area and maximally with respect to sites outside the area. In this study, however, we use hierarchical bipartite spectral graph partitioning (BiSGP), which allows a simultaneous identification of geographical areas together with their characteristic linguistic features. This approach has been successfully used to obtain the linguistic basis (in terms of sound correspondences) with respect to a certain reference pronunciation for Dutch (Wieling et al., 2010), English (Wieling et al., submitted) and Tuscan (Montemagni et al., forthcoming) dialect datasets. In contrast to analyzing pronunciation data, however, we investigate the use of specific lexical variants (per concept variable) from Voices data. Dataset The BBC Voices data contains a total of 38 concepts which are shown in Table 1 below. 1. Hot 2. Cold 3. Tired 4. Unwell 5. Pleased 6. Annoyed 7. Play a game 8. Play truant 9. Throw 10. Hit hard 11. Sleep 12. Drunk 13. Pregnant 14. Left-handed 15. Lacking money 16. Rich 17. Insane 18. Attractive 19. Unattractive 20. Moody 21. Baby 22. Mother 23. Grandmother 24. Grandfather 25. Friend 26. Male partner 27. Female partner 28. Young person in cheap trendy clothes and jewelry 29. Clothes 30. Trousers 31. Child’s soft shoes worn for PE 32. Main room of house (with TV) 33. Long, soft seat in the main room 34. Toilet 35. Narrow walkway alongside buildings 36. To rain lightly 37. To rain heavily 38. Running water smaller than a river Table 1: List of all 38 concepts in the BBC Voices dataset The complete dataset contains (on average) 19,326 responses per concept. We only include responses from the online questionnaire as the responses on the (identical) paper questionnaire have not been digitized. As a consequence of paper copies not being included, the average age of the people is relatively low (about 33) and more than sixty percent of the people were aged below thirty. A total of 57.3 percent of the participants were female. The responses were lemmatized in order to abstract away from variation in spelling. For example, ‘skive’, ‘scaive’, ‘scive’ (for the concept PLAY TRUANT) were grouped together. To simplify the data somewhat, we only select the top-ten variants for every concept (on average containing 84 percent of all responses). We group the responses by postcode area (there are a total of 121 UK postcodes) and for every (lemmatized) variant we calculate the percentage of people in the postcode area using this variant. Our input data thus consists of a table with 121 rows (the postcode areas) and 380 columns (38 concepts having 10 variants each) containing these percentages.","PeriodicalId":235034,"journal":{"name":"Lit. Linguistic Comput.","volume":"96 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"Analyzing the BBC Voices data: Contemporary English dialect areas and their characteristic lexical variants\",\"authors\":\"Martijn Wieling, Clive Upton, Ann Thompson\",\"doi\":\"10.1093/llc/fqt009\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This study investigates data from the BBC Voices project which contains a large amount of vernacular data collected by the BBC between 2004 and 2005. This project was designed primarily to collect information on vernacular speech around the United Kingdom for broadcasting purposes. As part of the project, a web-based questionnaire was created, to which tens of thousands of people supplied their way of denoting thirty-eight concepts which were known to exhibit marked lexical variation. Along with their variants, those responding to the online prompts provided information on their age, gender, and —significantly for this study— their location, this being recorded by means of their postcode. In this study we focus on the relative frequency of the top-ten variants for all concepts in every postcode area. By using hierarchical spectral partitioning of bipartite graphs, we are able to identify four contemporary geographical dialect areas together with their characteristic lexical variants. Even though these variants can be said to characterize their respective geographical area, they also occur in other areas, and not all people in a certain region use the characteristic variant. This supports the view that dialect regions are not clearly defined by strict borders, but are fuzzy at best. Introduction In 2004 and 2005, the British Broadcasting Corporation conducted a large-scale survey in order to obtain a contemporary view of English dialectal variation. People visiting a speciallyconstructed website were invited to offer their variants for thirty-eight concepts that were known to exhibit marked lexical variation. Along with their lexical use, informants were asked to provide details of their age, gender, and geographical (post-coded) location. Upwards of 29,000 people participated in this project (“BBC Voices”) to a greater or lesser degree, resulting in a substantial electronic dataset as a consequence. As dialectologists we are interested in investigating geographical structure which might be present in our data. Given the enormous size of the Voices lexical dataset (containing more than 700,000 responses in total), we use quantitative methods from dialectometry to provide an aggregate view of the contemporary English dialectal landscape. Dialectometry originated in the 1970’s (Séguy, 1973) to provide a more objective method of identifying dialect differences than by “cherry-picking” the features which support the analysis one wishes to settle on (Nerbonne, 2009). Unfortunately, dialectometry has not been received very favorably by some traditional dialectologists, as aggregate analyses obscure the importance of individual linguistic features, on which they are required to focus for their often philologically-directed purposes. Consequently, there have been a number of attempts to develop quantitative methods which enable the identification of characteristic linguistic variables. For example, Shackleton (2007) uses cluster analysis and principal component analysis (PCA) to identify linguistic variables which show a specific geographic distribution, while Grieve et al. (2011) uses spatial autocorrelation to detect significant geographical patterns in forty individual lexical alternation variables. Prokić et al. (2012) examine each item in a dataset, seeking those that differ minimally with a candidate area and maximally with respect to sites outside the area. In this study, however, we use hierarchical bipartite spectral graph partitioning (BiSGP), which allows a simultaneous identification of geographical areas together with their characteristic linguistic features. This approach has been successfully used to obtain the linguistic basis (in terms of sound correspondences) with respect to a certain reference pronunciation for Dutch (Wieling et al., 2010), English (Wieling et al., submitted) and Tuscan (Montemagni et al., forthcoming) dialect datasets. In contrast to analyzing pronunciation data, however, we investigate the use of specific lexical variants (per concept variable) from Voices data. Dataset The BBC Voices data contains a total of 38 concepts which are shown in Table 1 below. 1. Hot 2. Cold 3. Tired 4. Unwell 5. Pleased 6. Annoyed 7. Play a game 8. Play truant 9. Throw 10. Hit hard 11. Sleep 12. Drunk 13. Pregnant 14. Left-handed 15. Lacking money 16. Rich 17. Insane 18. Attractive 19. Unattractive 20. Moody 21. Baby 22. Mother 23. Grandmother 24. Grandfather 25. Friend 26. Male partner 27. Female partner 28. Young person in cheap trendy clothes and jewelry 29. Clothes 30. Trousers 31. Child’s soft shoes worn for PE 32. Main room of house (with TV) 33. Long, soft seat in the main room 34. Toilet 35. Narrow walkway alongside buildings 36. To rain lightly 37. To rain heavily 38. Running water smaller than a river Table 1: List of all 38 concepts in the BBC Voices dataset The complete dataset contains (on average) 19,326 responses per concept. We only include responses from the online questionnaire as the responses on the (identical) paper questionnaire have not been digitized. As a consequence of paper copies not being included, the average age of the people is relatively low (about 33) and more than sixty percent of the people were aged below thirty. A total of 57.3 percent of the participants were female. The responses were lemmatized in order to abstract away from variation in spelling. For example, ‘skive’, ‘scaive’, ‘scive’ (for the concept PLAY TRUANT) were grouped together. To simplify the data somewhat, we only select the top-ten variants for every concept (on average containing 84 percent of all responses). We group the responses by postcode area (there are a total of 121 UK postcodes) and for every (lemmatized) variant we calculate the percentage of people in the postcode area using this variant. Our input data thus consists of a table with 121 rows (the postcode areas) and 380 columns (38 concepts having 10 variants each) containing these percentages.\",\"PeriodicalId\":235034,\"journal\":{\"name\":\"Lit. Linguistic Comput.\",\"volume\":\"96 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Lit. Linguistic Comput.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1093/llc/fqt009\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lit. Linguistic Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/llc/fqt009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

本研究调查了BBC之声项目的数据，该项目包含了BBC在2004年至2005年间收集的大量方言数据。这个项目的主要目的是收集英国各地的方言信息，用于广播。作为这个项目的一部分，我们制作了一份基于网络的调查问卷，成千上万的人提供了他们表示38个概念的方式，这些概念在词汇上表现出明显的变化。除了他们的变体，那些对在线提示做出回应的人还提供了他们的年龄、性别和(在这项研究中很重要的)他们所在的位置的信息，这些信息通过他们的邮政编码被记录下来。在这项研究中，我们关注的是每个邮编区域中所有概念的前十个变体的相对频率。通过使用二部图的层次谱划分，我们能够识别四个当代地理方言区域及其特征词汇变体。尽管这些变体可以说是各自地理区域的特征，但它们也会出现在其他地区，并不是某一地区的所有人都使用这种特征变体。这支持了这样一种观点，即方言区域没有明确的严格边界，充其量是模糊的。在2004年和2005年，英国广播公司进行了一次大规模的调查，以获得对英语方言变异的当代看法。访问一个专门构建的网站的人被邀请提供38个概念的变体，这些概念在词汇上表现出明显的变化。除了他们的词汇使用外，调查人员还要求被调查者提供他们的年龄、性别和地理(后编码)位置的细节。超过29,000人或多或少地参与了这个项目(“BBC之声”)，结果产生了大量的电子数据集。作为方言学家，我们感兴趣的是调查可能存在于我们数据中的地理结构。鉴于voice词汇数据集的巨大规模(总共包含超过700,000个回复)，我们使用方言计量学的定量方法来提供当代英语方言景观的总体视图。方言法起源于20世纪70年代(ssamicguy, 1973)，它提供了一种更客观的方法来识别方言差异，而不是“挑选”支持分析的特征(Nerbonne, 2009)。不幸的是，一些传统的辩证法学家并没有很好地接受辩证法，因为综合分析掩盖了个别语言特征的重要性，而他们通常需要把重点放在语言学上的目的上。因此，已经进行了若干次尝试，以发展能够确定特征语言变量的定量方法。例如，Shackleton(2007)使用聚类分析和主成分分析(PCA)来识别显示特定地理分布的语言变量，而Grieve等人(2011)使用空间自相关来检测40个单独的词汇交替变量中的重要地理模式。prokiki et al.(2012)检查数据集中的每个项目，寻找与候选区域最小差异和与区域外站点最大差异的项目。然而，在这项研究中，我们使用了分层二部谱图划分(BiSGP)，它允许同时识别地理区域及其特征语言特征。该方法已成功用于获取荷兰语(Wieling et al.， 2010)、英语(Wieling et al.，已提交)和托斯卡纳语(Montemagni et al.，即将出版)方言数据集的某些参考发音的语言基础(在语音对应方面)。然而，与分析发音数据相比，我们研究了语音数据中特定词汇变体(每个概念变量)的使用情况。BBC voice数据共包含38个概念，如下表1所示。1. 热2。冷3。累了4。不适5。很高兴6。惹恼了7。玩一个游戏。逃学。把10。11.重击。睡眠12。喝醉了13。怀孕14。左撇子15。缺钱。丰富的17。疯狂的18岁。有吸引力的19。没有魅力的20倍。穆迪21。婴儿22。母亲23。祖母24。祖父25。朋友26岁。男性伴侣27。女性伴侣28。年轻人穿着便宜时髦的衣服，戴着珠宝。衣服30。裤子31。PE 32的儿童软鞋。房子的主房间(带电视)主厅34号的长软座。厕所35。36号楼旁狭窄的走道。下小雨:下小雨下大雨;表1:BBC voice数据集中所有38个概念的列表完整的数据集(平均)每个概念包含19,326个响应。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analyzing the BBC Voices data: Contemporary English dialect areas and their characteristic lexical variants

This study investigates data from the BBC Voices project which contains a large amount of vernacular data collected by the BBC between 2004 and 2005. This project was designed primarily to collect information on vernacular speech around the United Kingdom for broadcasting purposes. As part of the project, a web-based questionnaire was created, to which tens of thousands of people supplied their way of denoting thirty-eight concepts which were known to exhibit marked lexical variation. Along with their variants, those responding to the online prompts provided information on their age, gender, and —significantly for this study— their location, this being recorded by means of their postcode. In this study we focus on the relative frequency of the top-ten variants for all concepts in every postcode area. By using hierarchical spectral partitioning of bipartite graphs, we are able to identify four contemporary geographical dialect areas together with their characteristic lexical variants. Even though these variants can be said to characterize their respective geographical area, they also occur in other areas, and not all people in a certain region use the characteristic variant. This supports the view that dialect regions are not clearly defined by strict borders, but are fuzzy at best. Introduction In 2004 and 2005, the British Broadcasting Corporation conducted a large-scale survey in order to obtain a contemporary view of English dialectal variation. People visiting a speciallyconstructed website were invited to offer their variants for thirty-eight concepts that were known to exhibit marked lexical variation. Along with their lexical use, informants were asked to provide details of their age, gender, and geographical (post-coded) location. Upwards of 29,000 people participated in this project (“BBC Voices”) to a greater or lesser degree, resulting in a substantial electronic dataset as a consequence. As dialectologists we are interested in investigating geographical structure which might be present in our data. Given the enormous size of the Voices lexical dataset (containing more than 700,000 responses in total), we use quantitative methods from dialectometry to provide an aggregate view of the contemporary English dialectal landscape. Dialectometry originated in the 1970’s (Séguy, 1973) to provide a more objective method of identifying dialect differences than by “cherry-picking” the features which support the analysis one wishes to settle on (Nerbonne, 2009). Unfortunately, dialectometry has not been received very favorably by some traditional dialectologists, as aggregate analyses obscure the importance of individual linguistic features, on which they are required to focus for their often philologically-directed purposes. Consequently, there have been a number of attempts to develop quantitative methods which enable the identification of characteristic linguistic variables. For example, Shackleton (2007) uses cluster analysis and principal component analysis (PCA) to identify linguistic variables which show a specific geographic distribution, while Grieve et al. (2011) uses spatial autocorrelation to detect significant geographical patterns in forty individual lexical alternation variables. Prokić et al. (2012) examine each item in a dataset, seeking those that differ minimally with a candidate area and maximally with respect to sites outside the area. In this study, however, we use hierarchical bipartite spectral graph partitioning (BiSGP), which allows a simultaneous identification of geographical areas together with their characteristic linguistic features. This approach has been successfully used to obtain the linguistic basis (in terms of sound correspondences) with respect to a certain reference pronunciation for Dutch (Wieling et al., 2010), English (Wieling et al., submitted) and Tuscan (Montemagni et al., forthcoming) dialect datasets. In contrast to analyzing pronunciation data, however, we investigate the use of specific lexical variants (per concept variable) from Voices data. Dataset The BBC Voices data contains a total of 38 concepts which are shown in Table 1 below. 1. Hot 2. Cold 3. Tired 4. Unwell 5. Pleased 6. Annoyed 7. Play a game 8. Play truant 9. Throw 10. Hit hard 11. Sleep 12. Drunk 13. Pregnant 14. Left-handed 15. Lacking money 16. Rich 17. Insane 18. Attractive 19. Unattractive 20. Moody 21. Baby 22. Mother 23. Grandmother 24. Grandfather 25. Friend 26. Male partner 27. Female partner 28. Young person in cheap trendy clothes and jewelry 29. Clothes 30. Trousers 31. Child’s soft shoes worn for PE 32. Main room of house (with TV) 33. Long, soft seat in the main room 34. Toilet 35. Narrow walkway alongside buildings 36. To rain lightly 37. To rain heavily 38. Running water smaller than a river Table 1: List of all 38 concepts in the BBC Voices dataset The complete dataset contains (on average) 19,326 responses per concept. We only include responses from the online questionnaire as the responses on the (identical) paper questionnaire have not been digitized. As a consequence of paper copies not being included, the average age of the people is relatively low (about 33) and more than sixty percent of the people were aged below thirty. A total of 57.3 percent of the participants were female. The responses were lemmatized in order to abstract away from variation in spelling. For example, ‘skive’, ‘scaive’, ‘scive’ (for the concept PLAY TRUANT) were grouped together. To simplify the data somewhat, we only select the top-ten variants for every concept (on average containing 84 percent of all responses). We group the responses by postcode area (there are a total of 121 UK postcodes) and for every (lemmatized) variant we calculate the percentage of people in the postcode area using this variant. Our input data thus consists of a table with 121 rows (the postcode areas) and 380 columns (38 concepts having 10 variants each) containing these percentages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Lit. Linguistic Comput.

自引率

0.00%

发文量