迈向可比评级:探索德国医师评价的偏倚

IF 2.7 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data & Knowledge Engineering Pub Date : 2023-11-01 DOI:10.1016/j.datak.2023.102235

Joschka Kersting, Falk Maoro, Michaela Geierhos

{"title":"迈向可比评级:探索德国医师评价的偏倚","authors":"Joschka Kersting, Falk Maoro, Michaela Geierhos","doi":"10.1016/j.datak.2023.102235","DOIUrl":null,"url":null,"abstract":"<div><p>In this study, we evaluate the impact of gender-biased data from German-language physician reviews on the fairness of fine-tuned language models. For two different downstream tasks, we use data reported to be gender biased and aggregate it with annotations. First, we propose a new approach to aspect-based sentiment analysis that allows identifying, extracting, and classifying implicit and explicit aspect phrases and their polarity within a single model. The second task we present is grade prediction, where we predict the overall grade of a review on the basis of the review text. For both tasks, we train numerous transformer models and evaluate their performance. The aggregation of sensitive attributes, such as a physician’s gender and migration background, with individual text reviews allows us to measure the performance of the models with respect to these sensitive groups. These group-wise performance measures act as extrinsic bias measures for our downstream tasks. In addition, we translate several gender-specific templates of the intrinsic bias metrics into the German language and evaluate our fine-tuned models. Based on this set of tasks, fine-tuned models, and intrinsic and extrinsic bias measures, we perform correlation analyses between intrinsic and extrinsic bias measures. In terms of sensitive groups and effect sizes, our bias measure results show different directions. Furthermore, correlations between measures of intrinsic and extrinsic bias can be observed in different directions. This leads us to conclude that gender-biased data does not inherently lead to biased models. Other variables, such as template dependency for intrinsic measures and label distribution in the data, must be taken into account as they strongly influence the metric results. Therefore, we suggest that metrics and templates should be chosen according to the given task and the biases to be assessed.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"148 ","pages":"Article 102235"},"PeriodicalIF":2.7000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0169023X23000952/pdfft?md5=035f0e2eec55531089e125433a25b2bc&pid=1-s2.0-S0169023X23000952-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Towards comparable ratings: Exploring bias in German physician reviews\",\"authors\":\"Joschka Kersting, Falk Maoro, Michaela Geierhos\",\"doi\":\"10.1016/j.datak.2023.102235\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>In this study, we evaluate the impact of gender-biased data from German-language physician reviews on the fairness of fine-tuned language models. For two different downstream tasks, we use data reported to be gender biased and aggregate it with annotations. First, we propose a new approach to aspect-based sentiment analysis that allows identifying, extracting, and classifying implicit and explicit aspect phrases and their polarity within a single model. The second task we present is grade prediction, where we predict the overall grade of a review on the basis of the review text. For both tasks, we train numerous transformer models and evaluate their performance. The aggregation of sensitive attributes, such as a physician’s gender and migration background, with individual text reviews allows us to measure the performance of the models with respect to these sensitive groups. These group-wise performance measures act as extrinsic bias measures for our downstream tasks. In addition, we translate several gender-specific templates of the intrinsic bias metrics into the German language and evaluate our fine-tuned models. Based on this set of tasks, fine-tuned models, and intrinsic and extrinsic bias measures, we perform correlation analyses between intrinsic and extrinsic bias measures. In terms of sensitive groups and effect sizes, our bias measure results show different directions. Furthermore, correlations between measures of intrinsic and extrinsic bias can be observed in different directions. This leads us to conclude that gender-biased data does not inherently lead to biased models. Other variables, such as template dependency for intrinsic measures and label distribution in the data, must be taken into account as they strongly influence the metric results. Therefore, we suggest that metrics and templates should be chosen according to the given task and the biases to be assessed.</p></div>\",\"PeriodicalId\":55184,\"journal\":{\"name\":\"Data & Knowledge Engineering\",\"volume\":\"148 \",\"pages\":\"Article 102235\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0169023X23000952/pdfft?md5=035f0e2eec55531089e125433a25b2bc&pid=1-s2.0-S0169023X23000952-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data & Knowledge Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169023X23000952\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X23000952","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在这项研究中，我们评估了来自德语医师评论的性别偏见数据对微调语言模型公平性的影响。对于两个不同的下游任务，我们使用报告的有性别偏见的数据，并将其与注释一起汇总。首先，我们提出了一种基于方面的情感分析的新方法，该方法允许在单个模型中识别、提取和分类隐式和显式方面短语及其极性。我们提出的第二个任务是成绩预测，我们根据复习文本预测复习的总体成绩。对于这两项任务，我们训练了许多变压器模型并评估了它们的性能。敏感属性的聚合，例如医生的性别和迁移背景，以及单独的文本审查，使我们能够根据这些敏感组度量模型的性能。这些团队绩效指标作为我们下游任务的外在偏差指标。此外，我们将几个性别特定的内在偏见指标模板翻译成德语，并评估我们的微调模型。基于这组任务、微调模型以及内在和外在偏差测量，我们对内在和外在偏差测量进行了相关性分析。在敏感群体和效应大小方面，我们的偏倚测量结果显示出不同的方向。此外，内在偏差和外在偏差之间的相关性可以在不同的方向上观察到。这使我们得出结论，性别偏见的数据并不必然导致有偏见的模型。其他变量，如固有度量的模板依赖关系和数据中的标签分布，必须考虑在内，因为它们强烈影响度量结果。因此，我们建议指标和模板应该根据给定的任务和要评估的偏差来选择。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards comparable ratings: Exploring bias in German physician reviews

In this study, we evaluate the impact of gender-biased data from German-language physician reviews on the fairness of fine-tuned language models. For two different downstream tasks, we use data reported to be gender biased and aggregate it with annotations. First, we propose a new approach to aspect-based sentiment analysis that allows identifying, extracting, and classifying implicit and explicit aspect phrases and their polarity within a single model. The second task we present is grade prediction, where we predict the overall grade of a review on the basis of the review text. For both tasks, we train numerous transformer models and evaluate their performance. The aggregation of sensitive attributes, such as a physician’s gender and migration background, with individual text reviews allows us to measure the performance of the models with respect to these sensitive groups. These group-wise performance measures act as extrinsic bias measures for our downstream tasks. In addition, we translate several gender-specific templates of the intrinsic bias metrics into the German language and evaluate our fine-tuned models. Based on this set of tasks, fine-tuned models, and intrinsic and extrinsic bias measures, we perform correlation analyses between intrinsic and extrinsic bias measures. In terms of sensitive groups and effect sizes, our bias measure results show different directions. Furthermore, correlations between measures of intrinsic and extrinsic bias can be observed in different directions. This leads us to conclude that gender-biased data does not inherently lead to biased models. Other variables, such as template dependency for intrinsic measures and label distribution in the data, must be taken into account as they strongly influence the metric results. Therefore, we suggest that metrics and templates should be chosen according to the given task and the biases to be assessed.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Data & Knowledge Engineering 工程技术-计算机：人工智能

CiteScore

5.00

自引率

0.00%

发文量

审稿时长

6 months

期刊介绍： Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.