Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2023-06-13 DOI:10.1162/coli_a_00487

Alex Rosenfeld, L. Hinrichs

{"title":"Capturing Fine-Grained Regional Differences in Language Use through Voting Precinct Embeddings","authors":"Alex Rosenfeld, L. Hinrichs","doi":"10.1162/coli_a_00487","DOIUrl":null,"url":null,"abstract":"\n Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":" ","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2023-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00487","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Linguistic variation across a region of interest can be captured by partitioning the region into areas and using social media data to train embeddings that represent language use in those areas. Recent work has focused on larger areas, such as cities or counties, to ensure that enough social media data is available in each area, but larger areas have a limited ability to find fine grained distinctions, such as intracity differences in language use. We demonstrate that it is possible to embed smaller areas which can provide higher resolution analyses of language variation. We embed voting precincts which are tiny, evenly sized political divisions for the administration of elections. The issue with modeling language use in small areas is that the data becomes incredibly sparse with many areas having scant social media data.We propose a novel embedding approach that alternates training with smoothing which mitigates these sparsity issues. We focus on linguistic variation across Texas as it is relatively understudied. We developed two novel quantitative evaluations that measure how well the embeddings can be used to capture linguistic variation. The first evaluation measures how well a model can map a dialect given terms specific to that dialect. The second evaluation measures how well a model can map preference of lexical variants. These evaluations show how embedding models could be used directly by sociolinguists and measure how much sociolinguistic information is contained within the embeddings. We complement this second evaluation with a methodology for using embeddings as a kind of genetic code where we identify “genes” that correspond to a sociological variable and connect those “genes” to a linguistic phenomenon thereby connecting sociological phenomena to linguistic ones. Finally, we explore approaches for inferring isoglosses using embeddings.

查看原文本刊更多论文

通过投票区嵌入捕捉语言使用的精细区域差异

可以通过将感兴趣区域划分为多个区域并使用社交媒体数据来训练表示这些区域中的语言使用的嵌入来捕捉感兴趣区域之间的语言变化。最近的工作集中在较大的地区，如城市或县，以确保每个地区都有足够的社交媒体数据，但较大的地区发现细粒度差异的能力有限，例如城市内部语言使用的差异。我们证明了嵌入较小的区域是可能的，这可以提供更高分辨率的语言变体分析。我们嵌入了投票区，这些投票区是用于管理选举的小而均匀的政治分区。在小范围内建模语言使用的问题是，由于许多地区的社交媒体数据不足，数据变得极其稀疏。我们提出了一种新的嵌入方法，该方法将训练与平滑交替进行，从而缓解了这些稀疏性问题。我们关注的是得克萨斯州的语言变异，因为它的研究相对不足。我们开发了两种新的定量评估，用于衡量嵌入在捕捉语言变化方面的效果。第一个评估衡量了一个模型在给定特定于方言的术语的情况下映射方言的能力。第二个评估衡量了一个模型在多大程度上能够映射词汇变体的偏好。这些评估显示了嵌入模型如何被社会语言学家直接使用，并衡量嵌入中包含了多少社会语言学信息。我们用一种将嵌入作为一种遗传密码的方法来补充第二种评估，在这种方法中，我们识别对应于社会学变量的“基因”，并将这些“基因”与语言现象联系起来，从而将社会学现象与语言现象连接起来。最后，我们探索了使用嵌入来推断等光泽的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.