{"title":"From tweets to trends: analyzing sociolinguistic variation and change using the Twitter Corpus of English in Hong Kong (TCOEHK)","authors":"Wilkinson Daniel Wong Gonzales","doi":"10.1080/13488678.2023.2251771","DOIUrl":null,"url":null,"abstract":"ABSTRACTThis article presents the Twitter Corpus of English in Hong Kong (TCOEHK): a 123-million-word corpus derived from sampling tweets across the 18 districts and three geographical (macro-)regions of Hong Kong from 2010 to 2022. It introduces the corpus and demonstrates its utility by examining four linguistic variables found in English in Hong Kong (EngHK) and the dominant variety Hong Kong English (HKE): tense marking, ‘-ize/-ise’ suffix use, adverb syntactic position, and copula (non-)use. It explores their relationship with intralinguistic, stylistic (e.g. formality), and extralinguistic factors (e.g. region, year, affect). The findings show that the distribution of variants in all four variables (e.g. rates of -ize use) is similar to the patterns identified in prior HKE work. In addition to confirming previous research, the results also reveal how intralinguistic, stylistic, and extralinguistic factors can each influence the distribution of variants differently depending on the variable studied, highlighting the complex and ever-changing nature of EngHK. The availability of social metadata and the large size of the TCOEHK make it viable for examining (socio)linguistic variation and changes in contemporary (Twitter-style) EngHK, as well as potential regional and social sub-varieties/styles within EngHK. It promises to advance research on variation and change in EngHK.KEYWORDS: English in Hong Konglanguage variation and changeregional variationBayesian and deep learning methodslanguage and social media AcknowledgementsThis article has benefitted from the support of The Chinese University of Hong Kong Faculty of Arts Direct Grant (Exploring Variation and Change in Chinese-related Multilingual Practices in East Asia, Project # 4051228).Disclosure statementNo potential conflict of interest was reported by the author.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/13488678.2023.2251771Notes1. Since early 2023, Twitter has altered its API access, making it impossible for the public to scrape geo-location data without paying a large sum of money. This has made the TCOEHK additionally valuable.2. See http://doi.org/10.17605/OSF.IO/RBFCH.3. Both style predictors were estimated using Grafmiller, Szmrecsanyi, and Hinrichs' (Citation2018) method.4. I included ‘North vs. non-North’ as a variable as the North district is close to the Mainland border (see Figure 4), and speakers of English in this area may have patterns that differ from speakers living away from the border. This is plausible given the research on sociopolitical borders (including the Hong Kong–Shenzhen China border), which have uncovered the existence of complex and dynamic identities and the central role linguistic behavior plays in constructing and negotiating between identities (Danielewicz-Betz & Graddol, Citation2014; Holguín Mendoza, Citation2018; Watt, Llamas, Docherty, Hall, & Nycz, Citation2022).5. Sentiment was extracted using an R package that estimates the polarity of a string. A positive value indicates that the utterance is positive while a negative value indicates that the utterance is negative (Rinker, Citation2022).6. I also hope to demonstrate the application of the TCOEHK for conducting comprehensive and intricate sociolinguistic analyses. To achieve this, I will employ Wang et al.'s (Citation2019) M3 demographic inference tool, utilizing the final variable as a specific case study. This program takes a Twitter identification number as input, which is available in the TCOEHK, and examines various aspects such as the profile image, username, screen name, and biography of the Twitter user. It then generates probabilities that indicate the age, sex, and entity type (i.e. organization or non-organization) of the user, with relatively high precision and recall (Macro-F1: gender = 0.918, age = 0.522, entity type = 0.898) (Wang et al., Citation2019). These probabilities, which pertain to imputed or stylistic age and sex, can subsequently be utilized to explore the relationship between the copula variable and two influential predictors of sociolinguistic variation: age and sex. This exploration can be conducted even in the absence of actual age and sex metadata within the corpus, providing valuable insights into the connection between the copula variable and these demographic factors.7. The RegEx terms used were ‘did_AUX\\s(?:n’t_PART\\s|not_PART\\s)?(?:say|make|know|think|see|want|take|go|need|find|come|use|give|help|look|like|work|keep|feel|become|believe|tell|go|try|love|base|understand|seem|start|provide|live)_VERB\\s’ and ‘did_AUX\\s(?:n’t_PART\\s|not_PART\\s)?(?:said|made|knew|thought|saw|wanted|took|went|needed|found|came|used|gave|helped|looked|liked|worked|kept|felt|became|believed|told|went|tried|loved|based|understood|seemed|started|provided|lived)_VERB\\s’.8. The 21 'eyes words' were selected based on their frequency in the GloWBe corpus. They are presented in note 9.9. The RegEx terms used in my analyses are “recognize|realize|organize|utilize|minimize|maximize|apologize|emphasize|criticize|optimize|customize|specialize|summarize|visualize|capitalize|mobilize|prioritize|stabilize|characterize|authorize|memorize“ and ”recognise|realise|organise|utilise|minimise|maximise|apologise|emphasise|criticise|optimise|customise|specialise|summarise|visualise|capitalise|mobilise|prioritise|stabilise|characterise|authorise|memorise”. My TCOEHK analyses randomly sampled 50% of all tokens that met the RegEx criteria.10. The RegEx used is ”(?:also|already|only)_ADV\\s[\\w]+_VERB” and “[\\w]+_VERB\\s(?:[\\w]+_(?:PROPN\\s|NOUN\\s|PRON\\s))?(?:also|already|only)_ADV(?:$|\\s[.!?,]_PUNCT)”.11. The RegEx formula used is ‘(?:You|They|We|He|She|It)_PRON\\s(?:is|are)_AUX\\s(?:[\\w]+_ADV\\s)?(?:[\\w]+_ADJ\\s)’ and ‘(?:You|They|We|He|She|It)_PRON\\s(?:[\\w]+_ADV\\s)?(?:[\\w]+_ADJ\\s)’. The sampling rate is 50%.Additional informationFundingThis work was supported by the The Chinese University of Hong Kong Faculty of Arts [4051228].","PeriodicalId":44117,"journal":{"name":"Asian Englishes","volume":"70 1","pages":"0"},"PeriodicalIF":1.6000,"publicationDate":"2023-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asian Englishes","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1080/13488678.2023.2251771","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"LINGUISTICS","Score":null,"Total":0}
引用次数: 0
Abstract
ABSTRACTThis article presents the Twitter Corpus of English in Hong Kong (TCOEHK): a 123-million-word corpus derived from sampling tweets across the 18 districts and three geographical (macro-)regions of Hong Kong from 2010 to 2022. It introduces the corpus and demonstrates its utility by examining four linguistic variables found in English in Hong Kong (EngHK) and the dominant variety Hong Kong English (HKE): tense marking, ‘-ize/-ise’ suffix use, adverb syntactic position, and copula (non-)use. It explores their relationship with intralinguistic, stylistic (e.g. formality), and extralinguistic factors (e.g. region, year, affect). The findings show that the distribution of variants in all four variables (e.g. rates of -ize use) is similar to the patterns identified in prior HKE work. In addition to confirming previous research, the results also reveal how intralinguistic, stylistic, and extralinguistic factors can each influence the distribution of variants differently depending on the variable studied, highlighting the complex and ever-changing nature of EngHK. The availability of social metadata and the large size of the TCOEHK make it viable for examining (socio)linguistic variation and changes in contemporary (Twitter-style) EngHK, as well as potential regional and social sub-varieties/styles within EngHK. It promises to advance research on variation and change in EngHK.KEYWORDS: English in Hong Konglanguage variation and changeregional variationBayesian and deep learning methodslanguage and social media AcknowledgementsThis article has benefitted from the support of The Chinese University of Hong Kong Faculty of Arts Direct Grant (Exploring Variation and Change in Chinese-related Multilingual Practices in East Asia, Project # 4051228).Disclosure statementNo potential conflict of interest was reported by the author.Supplementary materialSupplemental data for this article can be accessed online at https://doi.org/10.1080/13488678.2023.2251771Notes1. Since early 2023, Twitter has altered its API access, making it impossible for the public to scrape geo-location data without paying a large sum of money. This has made the TCOEHK additionally valuable.2. See http://doi.org/10.17605/OSF.IO/RBFCH.3. Both style predictors were estimated using Grafmiller, Szmrecsanyi, and Hinrichs' (Citation2018) method.4. I included ‘North vs. non-North’ as a variable as the North district is close to the Mainland border (see Figure 4), and speakers of English in this area may have patterns that differ from speakers living away from the border. This is plausible given the research on sociopolitical borders (including the Hong Kong–Shenzhen China border), which have uncovered the existence of complex and dynamic identities and the central role linguistic behavior plays in constructing and negotiating between identities (Danielewicz-Betz & Graddol, Citation2014; Holguín Mendoza, Citation2018; Watt, Llamas, Docherty, Hall, & Nycz, Citation2022).5. Sentiment was extracted using an R package that estimates the polarity of a string. A positive value indicates that the utterance is positive while a negative value indicates that the utterance is negative (Rinker, Citation2022).6. I also hope to demonstrate the application of the TCOEHK for conducting comprehensive and intricate sociolinguistic analyses. To achieve this, I will employ Wang et al.'s (Citation2019) M3 demographic inference tool, utilizing the final variable as a specific case study. This program takes a Twitter identification number as input, which is available in the TCOEHK, and examines various aspects such as the profile image, username, screen name, and biography of the Twitter user. It then generates probabilities that indicate the age, sex, and entity type (i.e. organization or non-organization) of the user, with relatively high precision and recall (Macro-F1: gender = 0.918, age = 0.522, entity type = 0.898) (Wang et al., Citation2019). These probabilities, which pertain to imputed or stylistic age and sex, can subsequently be utilized to explore the relationship between the copula variable and two influential predictors of sociolinguistic variation: age and sex. This exploration can be conducted even in the absence of actual age and sex metadata within the corpus, providing valuable insights into the connection between the copula variable and these demographic factors.7. The RegEx terms used were ‘did_AUX\s(?:n’t_PART\s|not_PART\s)?(?:say|make|know|think|see|want|take|go|need|find|come|use|give|help|look|like|work|keep|feel|become|believe|tell|go|try|love|base|understand|seem|start|provide|live)_VERB\s’ and ‘did_AUX\s(?:n’t_PART\s|not_PART\s)?(?:said|made|knew|thought|saw|wanted|took|went|needed|found|came|used|gave|helped|looked|liked|worked|kept|felt|became|believed|told|went|tried|loved|based|understood|seemed|started|provided|lived)_VERB\s’.8. The 21 'eyes words' were selected based on their frequency in the GloWBe corpus. They are presented in note 9.9. The RegEx terms used in my analyses are “recognize|realize|organize|utilize|minimize|maximize|apologize|emphasize|criticize|optimize|customize|specialize|summarize|visualize|capitalize|mobilize|prioritize|stabilize|characterize|authorize|memorize“ and ”recognise|realise|organise|utilise|minimise|maximise|apologise|emphasise|criticise|optimise|customise|specialise|summarise|visualise|capitalise|mobilise|prioritise|stabilise|characterise|authorise|memorise”. My TCOEHK analyses randomly sampled 50% of all tokens that met the RegEx criteria.10. The RegEx used is ”(?:also|already|only)_ADV\s[\w]+_VERB” and “[\w]+_VERB\s(?:[\w]+_(?:PROPN\s|NOUN\s|PRON\s))?(?:also|already|only)_ADV(?:$|\s[.!?,]_PUNCT)”.11. The RegEx formula used is ‘(?:You|They|We|He|She|It)_PRON\s(?:is|are)_AUX\s(?:[\w]+_ADV\s)?(?:[\w]+_ADJ\s)’ and ‘(?:You|They|We|He|She|It)_PRON\s(?:[\w]+_ADV\s)?(?:[\w]+_ADJ\s)’. The sampling rate is 50%.Additional informationFundingThis work was supported by the The Chinese University of Hong Kong Faculty of Arts [4051228].
期刊介绍:
Asian Englishes seeks to publish the best papers dealing with various issues involved in the diffusion of English and its diversification in Asia and the Pacific. It aims to promote better understanding of the nature of English and the role which it plays in the linguistic repertoire of those who live and work in Asia, both intra- and internationally, and in spoken and written form. The journal particularly highlights such themes as: 1.Varieties of English in Asia – Including their divergence & convergence (phonetics, phonology, prosody, vocabulary, syntax, semantics, pragmatics, discourse, rhetoric) 2.ELT and English proficiency testing vis-a-vis English variation and international use of English 3.English as a language of international and intercultural communication in Asia 4.English-language journalism, literature, and other media 5.Social roles and functions of English in Asian countries 6.Multicultural English and mutual intelligibility 7.Language policy and language planning 8.Impact of English on other Asian languages 9.English-knowing bi- and multilingualism 10.English-medium education 11.Relevance of new paradigms, such as English as a Lingua Franca, to Asian contexts. 12.The depth of penetration, use in various domains, and future direction of English in (the development of) Asian Societies.