Evaluating Biases in Large Language Models Over Time: A Framework With a GPT Case Study on Political Bias

IF 1.5 4区数学 Q3 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Applied Stochastic Models in Business and Industry Pub Date : 2026-03-03 DOI:10.1002/asmb.70078

Meltem Aksoy, Erik Weber, Jérôme Rutinowski, Niklas Jost, Markus Pauly

{"title":"Evaluating Biases in Large Language Models Over Time: A Framework With a GPT Case Study on Political Bias","authors":"Meltem Aksoy, Erik Weber, Jérôme Rutinowski, Niklas Jost, Markus Pauly","doi":"10.1002/asmb.70078","DOIUrl":null,"url":null,"abstract":"<p>Large Language Models (LLMs) have repeatedly been shown to reflect systematic biases. At the same time, commercial LLMs are updated at a rapid rate, often without notice to end-users, so that a bias profile captured today may already be outdated tomorrow. However, the literature still leans heavily on one-shot evaluations of single model versions, leaving a gap in our understanding of how biases evolve over time and how they should be monitored. We address this gap by introducing a framework for longitudinal evaluation of biases in LLMs, focusing on political bias as a case study. The framework is model-agnostic, reproducible, and user-friendly. It consists of (i) locking model versions via dated identifiers to guarantee temporal comparability, (ii) multi-prompt questionnaires on position statements to analyze potential biases; and (iii) a longitudinal statistical evaluation framework that quantifies and infers absolute bias and drifts between models. Moreover, we suggest conducting (iv) cross-questionnaire correlation analyses to reveal orthogonal biases, as well as (v) sensitivity analyses on the model's role-assignment mechanisms to analyze robustness to concrete instructions. All code, prompts, and outputs are openly available to facilitate replication and extension to other bias analyses. To illustrate the framework, we investigate the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5, GPT-4, GPT-4o, and GPT-5.2. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. Across 4000 generated answers, we observe clear political shifts between versions: While newer models appear less left-leaning, they still mimic progressive personality profiles and exhibit biases. These findings demonstrate the persistence and transformation of biases across updates, underlining the need for longitudinal monitoring.</p>","PeriodicalId":55495,"journal":{"name":"Applied Stochastic Models in Business and Industry","volume":"42 2","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/asmb.70078","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Stochastic Models in Business and Industry","FirstCategoryId":"100","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/asmb.70078","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have repeatedly been shown to reflect systematic biases. At the same time, commercial LLMs are updated at a rapid rate, often without notice to end-users, so that a bias profile captured today may already be outdated tomorrow. However, the literature still leans heavily on one-shot evaluations of single model versions, leaving a gap in our understanding of how biases evolve over time and how they should be monitored. We address this gap by introducing a framework for longitudinal evaluation of biases in LLMs, focusing on political bias as a case study. The framework is model-agnostic, reproducible, and user-friendly. It consists of (i) locking model versions via dated identifiers to guarantee temporal comparability, (ii) multi-prompt questionnaires on position statements to analyze potential biases; and (iii) a longitudinal statistical evaluation framework that quantifies and infers absolute bias and drifts between models. Moreover, we suggest conducting (iv) cross-questionnaire correlation analyses to reveal orthogonal biases, as well as (v) sensitivity analyses on the model's role-assignment mechanisms to analyze robustness to concrete instructions. All code, prompts, and outputs are openly available to facilitate replication and extension to other bias analyses. To illustrate the framework, we investigate the political biases and personality traits of ChatGPT, specifically comparing GPT-3.5, GPT-4, GPT-4o, and GPT-5.2. In addition, the ability of the models to emulate political viewpoints (e.g., liberal or conservative positions) is analyzed. Across 4000 generated answers, we observe clear political shifts between versions: While newer models appear less left-leaning, they still mimic progressive personality profiles and exhibit biases. These findings demonstrate the persistence and transformation of biases across updates, underlining the need for longitudinal monitoring.

Abstract Image

查看原文本刊更多论文

随着时间的推移评估大型语言模型中的偏见：一个基于GPT的政治偏见案例研究框架

大型语言模型（llm）已经多次被证明反映了系统偏差。与此同时，商业法学硕士的更新速度很快，通常不会通知最终用户，因此今天捕获的偏见档案可能明天就过时了。然而，文献仍然严重依赖于单一模型版本的一次性评估，在我们对偏见如何随着时间的推移而演变以及如何监测它们的理解上留下了空白。我们通过引入法学硕士偏见纵向评估框架来解决这一差距，重点关注政治偏见作为案例研究。该框架与模型无关、可复制且用户友好。它包括(i)通过日期标识符锁定模型版本，以保证时间的可比性；（ii）对立场陈述进行多提示问卷调查，以分析潜在的偏差；（iii）纵向统计评估框架，量化和推断模型之间的绝对偏差和漂移。此外，我们建议进行（iv）交叉问卷相关分析以揭示正交偏倚，以及(v)对模型角色分配机制的敏感性分析以分析对具体指令的稳健性。所有代码、提示和输出都是公开可用的，以便于复制和扩展到其他偏差分析。为了说明这个框架，我们研究了ChatGPT的政治偏见和人格特征，特别是比较了GPT-3.5、GPT-4、gpt - 40和GPT-5.2。此外，还分析了模型模拟政治观点（例如，自由或保守立场）的能力。在4000个生成的答案中，我们观察到不同版本之间明显的政治转变：虽然新模型看起来不那么左倾，但它们仍然模仿进步的性格特征，并表现出偏见。这些发现证明了偏见在更新过程中的持续存在和转变，强调了纵向监测的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Stochastic Models in Business and Industry 数学-数学跨学科应用

CiteScore

2.70

自引率

0.00%

发文量

审稿时长

>12 weeks

期刊介绍： ASMBI - Applied Stochastic Models in Business and Industry (formerly Applied Stochastic Models and Data Analysis) was first published in 1985, publishing contributions in the interface between stochastic modelling, data analysis and their applications in business, finance, insurance, management and production. In 2007 ASMBI became the official journal of the International Society for Business and Industrial Statistics (www.isbis.org). The main objective is to publish papers, both technical and practical, presenting new results which solve real-life problems or have great potential in doing so. Mathematical rigour, innovative stochastic modelling and sound applications are the key ingredients of papers to be published, after a very selective review process. The journal is very open to new ideas, like Data Science and Big Data stemming from problems in business and industry or uncertainty quantification in engineering, as well as more traditional ones, like reliability, quality control, design of experiments, managerial processes, supply chains and inventories, insurance, econometrics, financial modelling (provided the papers are related to real problems). The journal is interested also in papers addressing the effects of business and industrial decisions on the environment, healthcare, social life. State-of-the art computational methods are very welcome as well, when combined with sound applications and innovative models.