人口背景对词汇使用的反映

IF 5.3 2区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computational Linguistics Pub Date : 2023-01-17 DOI:10.1162/coli_a_00475

Aparna Garimella, Carmen Banea, Rada Mihalcea

{"title":"人口背景对词汇使用的反映","authors":"Aparna Garimella, Carmen Banea, Rada Mihalcea","doi":"10.1162/coli_a_00475","DOIUrl":null,"url":null,"abstract":"The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.","PeriodicalId":55229,"journal":{"name":"Computational Linguistics","volume":"49 1","pages":"373-394"},"PeriodicalIF":5.3000,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reflection of Demographic Background on Word Usage\",\"authors\":\"Aparna Garimella, Carmen Banea, Rada Mihalcea\",\"doi\":\"10.1162/coli_a_00475\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.\",\"PeriodicalId\":55229,\"journal\":{\"name\":\"Computational Linguistics\",\"volume\":\"49 1\",\"pages\":\"373-394\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2023-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computational Linguistics\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1162/coli_a_00475\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00475","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

电子格式的个人写作为语言学、心理学和计算语言学领域的研究人员提供了一个前所未有的机会，可以大规模研究语言使用与作家人口背景之间的关系，使我们能够更好地了解不同人口结构的人。在这篇文章中，我们通过开发跨人口统计的单词模型来分析语言和人口统计之间的关系，以识别有使用偏见的单词，或者不同人口统计的说话者以显著不同的方式使用的单词。围绕三个人口统计学类别，即地点、性别和行业，我们确定了每个类别中具有显著用法差异的单词，并研究了对单词用法进行编码的各种方法，使我们能够确定导致差异的语言方面。我们使用基于主题的特征的单词模型在所有人口统计类别的准确率上都比基线提高了至少20%，即使是在分类为15个类别的场景中也是如此，这说明了基于主题的功能在识别单词使用差异方面的有用性。此外，我们注意到，对于地点和行业，从直接上下文中提取的主题是单词用法的最佳预测因素，暗示了单词含义及其语法功能对这些人口统计的重要性，而对于性别，从较长上下文中提取主题是单词使用的更好预测因素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reflection of Demographic Background on Word Usage

The availability of personal writings in electronic format provides researchers in the fields of linguistics, psychology, and computational linguistics with an unprecedented chance to study, on a large scale, the relationship between language use and the demographic background of writers, allowing us to better understand people across different demographics. In this article, we analyze the relation between language and demographics by developing cross-demographic word models to identify words with usage bias, or words that are used in significantly different ways by speakers of different demographics. Focusing on three demographic categories, namely, location, gender, and industry, we identify words with significant usage differences in each category and investigate various approaches of encoding a word’s usage, allowing us to identify language aspects that contribute to the differences. Our word models using topic-based features achieve at least 20% improvement in accuracy over the baseline for all demographic categories, even for scenarios with classification into 15 categories, illustrating the usefulness of topic-based features in identifying word usage differences. Further, we note that for location and industry, topics extracted from immediate context are the best predictors of word usages, hinting at the importance of word meaning and its grammatical function for these demographics, while for gender, topics obtained from longer contexts are better predictors for word usage.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computational Linguistics 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： Computational Linguistics, the longest-running publication dedicated solely to the computational and mathematical aspects of language and the design of natural language processing systems, provides university and industry linguists, computational linguists, AI and machine learning researchers, cognitive scientists, speech specialists, and philosophers with the latest insights into the computational aspects of language research.