Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage.

IF 2.2 Q3 HEALTH CARE SCIENCES & SERVICES

International Journal of Population Data Science Pub Date : 2025-03-11 eCollection Date: 2023-01-01 DOI:10.23889/ijpds.v8i5.2935

Joseph Lam, Mario Cortina-Borja, Robert Aldridge, Ruth Blackburn, Katie Harron

{"title":"Data Note: Alternative Name Encodings - Using Jyutping or Pinyin as tonal representations of Chinese names for data linkage.","authors":"Joseph Lam, Mario Cortina-Borja, Robert Aldridge, Ruth Blackburn, Katie Harron","doi":"10.23889/ijpds.v8i5.2935","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors-stemming from inconsistent name representations-can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKG-romanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.</p>","PeriodicalId":36483,"journal":{"name":"International Journal of Population Data Science","volume":"8 5","pages":"2935"},"PeriodicalIF":2.2000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11897931/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Population Data Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23889/ijpds.v8i5.2935","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate data linkage across large administrative databases is crucial for addressing complex research and policy questions, yet linkage errors-stemming from inconsistent name representations-can introduce biases, predominantly for names not given in English. This data note examines the impact of romanisation on linkage accuracy, focusing on Chinese names and comparing standardised systems (Jyutping and Pinyin) with the non-standardised Hong Kong Government Cantonese Romanisation (HKG-romanisation). We identify three primary issues: language-specific variations in romanisation, the loss of tonal information inherent to tonal languages, and discrepancies in name order conventions. Using a dataset of 771 Hong Kong student names, our analysis reveals that standardised romanisation systems enhance the uniqueness and consistency of name representations, thereby improving linkage precision and recall compared to HKG-romanisation. Specifically, Jyutping and Pinyin achieved over 95% recall in blocking strategies, whereas HKG-romanisation only reached 68.8%. Incorporating tonal information further improved recall. These findings underscore the necessity of adopting standardised, tone-sensitive romanisation systems and flexible database designs to reduce linkage errors and promote data equity for under-represented groups. We advocate for the implementation of phonetic encodings in databases, alongside language-specific pre-processing protocols, to ensure more inclusive and accurate data linkage processes.

Abstract Image

查看原文本刊更多论文

数据说明：备选名称编码-使用拼音或拼音作为数据链接的中文名称的音调表示。

跨大型管理数据库的准确数据链接对于解决复杂的研究和政策问题至关重要，然而链接错误——源于不一致的名称表示——可能会引入偏见，主要是对于非英文名称。本数据记录考察了罗马化对链接准确性的影响，重点是中文名称，并比较了标准化系统（拼音和拼音）和非标准化的香港政府粤语罗马化（HKG-romanisation）。我们确定了三个主要问题：罗马化的语言特定变化，声调语言固有的音调信息的丢失，以及名称顺序约定的差异。使用771个香港学生姓名数据集，我们的分析显示，标准化罗马化系统提高了姓名表示的唯一性和一致性，从而提高了连接精度和召回率。在屏蔽策略中，拼字和拼音的召回率达到95%以上，而香港字母罗马化的召回率仅为68.8%。结合音调信息进一步提高了记忆力。这些发现强调了采用标准化、音调敏感的罗马化系统和灵活的数据库设计的必要性，以减少链接错误，促进代表性不足群体的数据公平。我们提倡在数据库中实施语音编码，同时采用特定语言的预处理协议，以确保更包容和准确的数据链接过程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊