From tweets to trends: analyzing sociolinguistic variation and change using the Twitter Corpus of English in Hong Kong (TCOEHK)

IF 1.6 Q1 LINGUISTICS
Wilkinson Daniel Wong Gonzales
{"title":"From tweets to trends: analyzing sociolinguistic variation and change using the Twitter Corpus of English in Hong Kong (TCOEHK)","authors":"Wilkinson Daniel Wong Gonzales","doi":"10.1080/13488678.2023.2251771","DOIUrl":null,"url":null,"abstract":"ABSTRACTThis article presents the Twitter Corpus of English in Hong Kong (TCOEHK): a 123-million-word corpus derived from sampling tweets across the 18 districts and three geographical (macro-)regions of Hong Kong from 2010 to 2022. It introduces the corpus and demonstrates its utility by examining four linguistic variables found in English in Hong Kong (EngHK) and the dominant variety Hong Kong English (HKE): tense marking, ‘-ize/-ise’ suffix use, adverb syntactic position, and copula (non-)use. It explores their relationship with intralinguistic, stylistic (e.g. formality), and extralinguistic factors (e.g. region, year, affect). The findings show that the distribution of variants in all four variables (e.g. rates of -ize use) is similar to the patterns identified in prior HKE work. In addition to confirming previous research, the results also reveal how intralinguistic, stylistic, and extralinguistic factors can each influence the distribution of variants differently depending on the variable studied, highlighting the complex and ever-changing nature of EngHK. The availability of social metadata and the large size of the TCOEHK make it viable for examining (socio)linguistic variation and changes in contemporary (Twitter-style) EngHK, as well as potential regional and social sub-varieties/styles within EngHK. It promises to advance research on variation and change in EngHK.KEYWORDS: English in Hong Konglanguage variation and changeregional variationBayesian and deep learning methodslanguage and social media AcknowledgementsThis article has benefitted from the support of The Chinese University of Hong Kong Faculty of Arts Direct Grant (Exploring Variation and Change in Chinese-related Multilingual Practices in East Asia, Project # 4051228).Disclosure statementNo potential conflict of interest was reported by the author.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/13488678.2023.2251771Notes1. Since early 2023, Twitter has altered its API access, making it impossible for the public to scrape geo-location data without paying a large sum of money. This has made the TCOEHK additionally valuable.2. See http://doi.org/10.17605/OSF.IO/RBFCH.3. Both style predictors were estimated using Grafmiller, Szmrecsanyi, and Hinrichs' (Citation2018) method.4. I included ‘North vs. non-North’ as a variable as the North district is close to the Mainland border (see Figure 4), and speakers of English in this area may have patterns that differ from speakers living away from the border. This is plausible given the research on sociopolitical borders (including the Hong Kong–Shenzhen China border), which have uncovered the existence of complex and dynamic identities and the central role linguistic behavior plays in constructing and negotiating between identities (Danielewicz-Betz & Graddol, Citation2014; Holguín Mendoza, Citation2018; Watt, Llamas, Docherty, Hall, & Nycz, Citation2022).5. Sentiment was extracted using an R package that estimates the polarity of a string. A positive value indicates that the utterance is positive while a negative value indicates that the utterance is negative (Rinker, Citation2022).6. I also hope to demonstrate the application of the TCOEHK for conducting comprehensive and intricate sociolinguistic analyses. To achieve this, I will employ Wang et al.'s (Citation2019) M3 demographic inference tool, utilizing the final variable as a specific case study. This program takes a Twitter identification number as input, which is available in the TCOEHK, and examines various aspects such as the profile image, username, screen name, and biography of the Twitter user. It then generates probabilities that indicate the age, sex, and entity type (i.e. organization or non-organization) of the user, with relatively high precision and recall (Macro-F1: gender = 0.918, age = 0.522, entity type = 0.898) (Wang et al., Citation2019). These probabilities, which pertain to imputed or stylistic age and sex, can subsequently be utilized to explore the relationship between the copula variable and two influential predictors of sociolinguistic variation: age and sex. This exploration can be conducted even in the absence of actual age and sex metadata within the corpus, providing valuable insights into the connection between the copula variable and these demographic factors.7. The RegEx terms used were ‘did_AUX\\s(?:n’t_PART\\s|not_PART\\s)?(?:say|make|know|think|see|want|take|go|need|find|come|use|give|help|look|like|work|keep|feel|become|believe|tell|go|try|love|base|understand|seem|start|provide|live)_VERB\\s’ and ‘did_AUX\\s(?:n’t_PART\\s|not_PART\\s)?(?:said|made|knew|thought|saw|wanted|took|went|needed|found|came|used|gave|helped|looked|liked|worked|kept|felt|became|believed|told|went|tried|loved|based|understood|seemed|started|provided|lived)_VERB\\s’.8. The 21 'eyes words' were selected based on their frequency in the GloWBe corpus. They are presented in note 9.9. The RegEx terms used in my analyses are “recognize|realize|organize|utilize|minimize|maximize|apologize|emphasize|criticize|optimize|customize|specialize|summarize|visualize|capitalize|mobilize|prioritize|stabilize|characterize|authorize|memorize“ and ”recognise|realise|organise|utilise|minimise|maximise|apologise|emphasise|criticise|optimise|customise|specialise|summarise|visualise|capitalise|mobilise|prioritise|stabilise|characterise|authorise|memorise”. My TCOEHK analyses randomly sampled 50% of all tokens that met the RegEx criteria.10. The RegEx used is ”(?:also|already|only)_ADV\\s[\\w]+_VERB” and “[\\w]+_VERB\\s(?:[\\w]+_(?:PROPN\\s|NOUN\\s|PRON\\s))?(?:also|already|only)_ADV(?:$|\\s[.!?,]_PUNCT)”.11. The RegEx formula used is ‘(?:You|They|We|He|She|It)_PRON\\s(?:is|are)_AUX\\s(?:[\\w]+_ADV\\s)?(?:[\\w]+_ADJ\\s)’ and ‘(?:You|They|We|He|She|It)_PRON\\s(?:[\\w]+_ADV\\s)?(?:[\\w]+_ADJ\\s)’. The sampling rate is 50%.Additional informationFundingThis work was supported by the The Chinese University of Hong Kong Faculty of Arts [4051228].","PeriodicalId":44117,"journal":{"name":"Asian Englishes","volume":"70 1","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asian Englishes","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/13488678.2023.2251771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

ABSTRACTThis article presents the Twitter Corpus of English in Hong Kong (TCOEHK): a 123-million-word corpus derived from sampling tweets across the 18 districts and three geographical (macro-)regions of Hong Kong from 2010 to 2022. It introduces the corpus and demonstrates its utility by examining four linguistic variables found in English in Hong Kong (EngHK) and the dominant variety Hong Kong English (HKE): tense marking, ‘-ize/-ise’ suffix use, adverb syntactic position, and copula (non-)use. It explores their relationship with intralinguistic, stylistic (e.g. formality), and extralinguistic factors (e.g. region, year, affect). The findings show that the distribution of variants in all four variables (e.g. rates of -ize use) is similar to the patterns identified in prior HKE work. In addition to confirming previous research, the results also reveal how intralinguistic, stylistic, and extralinguistic factors can each influence the distribution of variants differently depending on the variable studied, highlighting the complex and ever-changing nature of EngHK. The availability of social metadata and the large size of the TCOEHK make it viable for examining (socio)linguistic variation and changes in contemporary (Twitter-style) EngHK, as well as potential regional and social sub-varieties/styles within EngHK. It promises to advance research on variation and change in EngHK.KEYWORDS: English in Hong Konglanguage variation and changeregional variationBayesian and deep learning methodslanguage and social media AcknowledgementsThis article has benefitted from the support of The Chinese University of Hong Kong Faculty of Arts Direct Grant (Exploring Variation and Change in Chinese-related Multilingual Practices in East Asia, Project # 4051228).Disclosure statementNo potential conflict of interest was reported by the author.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/13488678.2023.2251771Notes1. Since early 2023, Twitter has altered its API access, making it impossible for the public to scrape geo-location data without paying a large sum of money. This has made the TCOEHK additionally valuable.2. See http://doi.org/10.17605/OSF.IO/RBFCH.3. Both style predictors were estimated using Grafmiller, Szmrecsanyi, and Hinrichs' (Citation2018) method.4. I included ‘North vs. non-North’ as a variable as the North district is close to the Mainland border (see Figure 4), and speakers of English in this area may have patterns that differ from speakers living away from the border. This is plausible given the research on sociopolitical borders (including the Hong Kong–Shenzhen China border), which have uncovered the existence of complex and dynamic identities and the central role linguistic behavior plays in constructing and negotiating between identities (Danielewicz-Betz & Graddol, Citation2014; Holguín Mendoza, Citation2018; Watt, Llamas, Docherty, Hall, & Nycz, Citation2022).5. Sentiment was extracted using an R package that estimates the polarity of a string. A positive value indicates that the utterance is positive while a negative value indicates that the utterance is negative (Rinker, Citation2022).6. I also hope to demonstrate the application of the TCOEHK for conducting comprehensive and intricate sociolinguistic analyses. To achieve this, I will employ Wang et al.'s (Citation2019) M3 demographic inference tool, utilizing the final variable as a specific case study. This program takes a Twitter identification number as input, which is available in the TCOEHK, and examines various aspects such as the profile image, username, screen name, and biography of the Twitter user. It then generates probabilities that indicate the age, sex, and entity type (i.e. organization or non-organization) of the user, with relatively high precision and recall (Macro-F1: gender = 0.918, age = 0.522, entity type = 0.898) (Wang et al., Citation2019). These probabilities, which pertain to imputed or stylistic age and sex, can subsequently be utilized to explore the relationship between the copula variable and two influential predictors of sociolinguistic variation: age and sex. This exploration can be conducted even in the absence of actual age and sex metadata within the corpus, providing valuable insights into the connection between the copula variable and these demographic factors.7. The RegEx terms used were ‘did_AUX\s(?:n’t_PART\s|not_PART\s)?(?:say|make|know|think|see|want|take|go|need|find|come|use|give|help|look|like|work|keep|feel|become|believe|tell|go|try|love|base|understand|seem|start|provide|live)_VERB\s’ and ‘did_AUX\s(?:n’t_PART\s|not_PART\s)?(?:said|made|knew|thought|saw|wanted|took|went|needed|found|came|used|gave|helped|looked|liked|worked|kept|felt|became|believed|told|went|tried|loved|based|understood|seemed|started|provided|lived)_VERB\s’.8. The 21 'eyes words' were selected based on their frequency in the GloWBe corpus. They are presented in note 9.9. The RegEx terms used in my analyses are “recognize|realize|organize|utilize|minimize|maximize|apologize|emphasize|criticize|optimize|customize|specialize|summarize|visualize|capitalize|mobilize|prioritize|stabilize|characterize|authorize|memorize“ and ”recognise|realise|organise|utilise|minimise|maximise|apologise|emphasise|criticise|optimise|customise|specialise|summarise|visualise|capitalise|mobilise|prioritise|stabilise|characterise|authorise|memorise”. My TCOEHK analyses randomly sampled 50% of all tokens that met the RegEx criteria.10. The RegEx used is ”(?:also|already|only)_ADV\s[\w]+_VERB” and “[\w]+_VERB\s(?:[\w]+_(?:PROPN\s|NOUN\s|PRON\s))?(?:also|already|only)_ADV(?:$|\s[.!?,]_PUNCT)”.11. The RegEx formula used is ‘(?:You|They|We|He|She|It)_PRON\s(?:is|are)_AUX\s(?:[\w]+_ADV\s)?(?:[\w]+_ADJ\s)’ and ‘(?:You|They|We|He|She|It)_PRON\s(?:[\w]+_ADV\s)?(?:[\w]+_ADJ\s)’. The sampling rate is 50%.Additional informationFundingThis work was supported by the The Chinese University of Hong Kong Faculty of Arts [4051228].
从推文到趋势:利用香港推特英语语料库分析社会语言学的变异和变化
摘要本文介绍了香港推特英语语料库(TCOEHK):一个1.23亿字的语料库,来自2010年至2022年香港18个地区和三个地理(宏观)区域的推文样本。本文介绍了语料库,并通过分析香港英语(EngHK)和香港英语(HKE)中发现的四个语言变量来说明它的实用性:时态标记、“-ize/-ise”后缀的使用、副词句法位置和连词(非-)的使用。它探讨了它们与语言内、文体(如形式)和语言外因素(如地区、年份、影响)的关系。研究结果表明,所有四个变量的变异分布(例如- size使用率)与香港大学之前的研究发现的模式相似。除了证实了之前的研究,研究结果还揭示了语言内因素、语言风格因素和语言外因素如何根据所研究的变量对变体分布产生不同的影响,突出了英语的复杂性和不断变化的本质。社会元数据的可用性和庞大的中文中文资料库,使得它可以用来研究当代(twitter风格)中文中文的(社会)语言变异和变化,以及中文中文中潜在的区域和社会子变体/风格。它有望推进英语变异和变化的研究。关键词:香港英语语言变异与变化区域变异贝叶斯与深度学习方法语言与社交媒体致谢本文获得香港中文大学文学院直接资助项目(探索东亚与中文相关的多语言实践的变异与变化,项目# 4051228)的支持。披露声明作者未报告潜在的利益冲突。补充材料本文的补充数据可在https://doi.org/10.1080/13488678.2023.2251771Notes1上在线获取。自2023年初以来,Twitter改变了其API访问方式,使公众不可能在不支付大笔资金的情况下获取地理位置数据。这使得香港商贸联会更有价值。见http://doi.org/10.17605/OSF.IO/RBFCH.3。使用Grafmiller, Szmrecsanyi和Hinrichs (Citation2018)的方法估计了这两种风格预测因子。我将“北部与非北部”作为一个变量,因为北部地区靠近大陆边界(见图4),该地区讲英语的人可能与生活在边境以外的人有不同的模式。考虑到对社会政治边界(包括中国香港-深圳边界)的研究,这是合理的,这些研究揭示了复杂和动态身份的存在,以及语言行为在身份之间的构建和协商中发挥的核心作用(Danielewicz-Betz & Graddol, Citation2014;Holguín Mendoza, Citation2018;Watt, Llamas, Docherty, Hall, & Nycz, Citation2022).5。使用R包提取情感,该包可以估计字符串的极性。5 .正值表示该话语是积极的,负值表示该话语是消极的(Rinker, Citation2022)。我亦希望借此展示《中华文化交流中心》在进行全面而复杂的社会语言学分析方面的应用。为了实现这一点,我将采用Wang等人的(Citation2019) M3人口统计推断工具,利用最后一个变量作为具体的案例研究。该程序以Twitter识别号码作为输入(该号码可在TCOEHK中获得),并检查Twitter用户的个人资料图像、用户名、屏幕名称和个人简介等各个方面。然后生成指示用户的年龄、性别和实体类型(即组织或非组织)的概率,具有相对较高的精度和召回率(Macro-F1:性别= 0.918,年龄= 0.522,实体类型= 0.898)(Wang et al., Citation2019)。这些概率与输入的或文体的年龄和性别有关,随后可以用来探索copula变量与社会语言学变化的两个有影响的预测因子:年龄和性别之间的关系。即使语料库中没有实际的年龄和性别元数据,也可以进行这种探索,为copula变量与这些人口统计因素之间的联系提供有价值的见解。RegEx术语“did_AUX \ s (?: n 't_PART \ s | not_PART \ s) ?(?:说| | | |想看到|希望| | | | | |来找到需要使用| |帮助看| | | | | |保持工作感觉成为|相信|告诉| |去试着爱| | |基地理解似乎| | | |提供生活开始)_VERB \ s”和“did_AUX \ s (?: n 't_PART \ s | not_PART \ s) ?(?: | | | |想知道看到|想要| | | |需要去发现了| | |使用来给|帮助看上去| |喜欢| | |保存工作| |了信告诉| | | |喜欢|尝试建立了| | |似乎开始理解| |生活提供)_VERB \ s。8。这21个“眼睛词”是根据它们在GloWBe语料库中的频率选择的。它们载于说明9.9。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Asian Englishes
Asian Englishes LINGUISTICS-
CiteScore
3.30
自引率
18.80%
发文量
34
期刊介绍: Asian Englishes seeks to publish the best papers dealing with various issues involved in the diffusion of English and its diversification in Asia and the Pacific. It aims to promote better understanding of the nature of English and the role which it plays in the linguistic repertoire of those who live and work in Asia, both intra- and internationally, and in spoken and written form. The journal particularly highlights such themes as: 1.Varieties of English in Asia – Including their divergence & convergence (phonetics, phonology, prosody, vocabulary, syntax, semantics, pragmatics, discourse, rhetoric) 2.ELT and English proficiency testing vis-a-vis English variation and international use of English 3.English as a language of international and intercultural communication in Asia 4.English-language journalism, literature, and other media 5.Social roles and functions of English in Asian countries 6.Multicultural English and mutual intelligibility 7.Language policy and language planning 8.Impact of English on other Asian languages 9.English-knowing bi- and multilingualism 10.English-medium education 11.Relevance of new paradigms, such as English as a Lingua Franca, to Asian contexts. 12.The depth of penetration, use in various domains, and future direction of English in (the development of) Asian Societies.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信