Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven
{"title":"HEARTS:可解释、可持续和稳健的文本刻板印象检测整体框架","authors":"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven","doi":"arxiv-2409.11579","DOIUrl":null,"url":null,"abstract":"Stereotypes are generalised assumptions about societal groups, and even\nstate-of-the-art LLMs using in-context learning struggle to identify them\naccurately. Due to the subjective nature of stereotypes, where what constitutes\na stereotype can vary widely depending on cultural, social, and individual\nperspectives, robust explainability is crucial. Explainable models ensure that\nthese nuanced judgments can be understood and validated by human users,\npromoting trust and accountability. We address these challenges by introducing\nHEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\nStereotype Detection), a framework that enhances model performance, minimises\ncarbon footprint, and provides transparent, interpretable explanations. We\nestablish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\n57,201 labeled texts across six groups, including under-represented\ndemographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\nthat BERT models fine-tuned on EMGSD outperform those trained on individual\ncomponents. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\nusing SHAP to generate token-level importance values, ensuring alignment with\nhuman understanding, and calculate explainability confidence scores by\ncomparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\nstereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\nover time within model families.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection\",\"authors\":\"Theo King, Zekun Wu, Adriano Koshiyama, Emre Kazim, Philip Treleaven\",\"doi\":\"arxiv-2409.11579\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Stereotypes are generalised assumptions about societal groups, and even\\nstate-of-the-art LLMs using in-context learning struggle to identify them\\naccurately. Due to the subjective nature of stereotypes, where what constitutes\\na stereotype can vary widely depending on cultural, social, and individual\\nperspectives, robust explainability is crucial. Explainable models ensure that\\nthese nuanced judgments can be understood and validated by human users,\\npromoting trust and accountability. We address these challenges by introducing\\nHEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text\\nStereotype Detection), a framework that enhances model performance, minimises\\ncarbon footprint, and provides transparent, interpretable explanations. We\\nestablish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising\\n57,201 labeled texts across six groups, including under-represented\\ndemographics like LGBTQ+ and regional stereotypes. Ablation studies confirm\\nthat BERT models fine-tuned on EMGSD outperform those trained on individual\\ncomponents. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model\\nusing SHAP to generate token-level importance values, ensuring alignment with\\nhuman understanding, and calculate explainability confidence scores by\\ncomparing SHAP and LIME outputs. Finally, HEARTS is applied to assess\\nstereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias\\nover time within model families.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"15 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11579\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11579","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Stereotypes are generalised assumptions about societal groups, and even
state-of-the-art LLMs using in-context learning struggle to identify them
accurately. Due to the subjective nature of stereotypes, where what constitutes
a stereotype can vary widely depending on cultural, social, and individual
perspectives, robust explainability is crucial. Explainable models ensure that
these nuanced judgments can be understood and validated by human users,
promoting trust and accountability. We address these challenges by introducing
HEARTS (Holistic Framework for Explainable, Sustainable, and Robust Text
Stereotype Detection), a framework that enhances model performance, minimises
carbon footprint, and provides transparent, interpretable explanations. We
establish the Expanded Multi-Grain Stereotype Dataset (EMGSD), comprising
57,201 labeled texts across six groups, including under-represented
demographics like LGBTQ+ and regional stereotypes. Ablation studies confirm
that BERT models fine-tuned on EMGSD outperform those trained on individual
components. We then analyse a fine-tuned, carbon-efficient ALBERT-V2 model
using SHAP to generate token-level importance values, ensuring alignment with
human understanding, and calculate explainability confidence scores by
comparing SHAP and LIME outputs. Finally, HEARTS is applied to assess
stereotypical bias in 12 LLM outputs, revealing a gradual reduction in bias
over time within model families.