Evaluating the impact of model scale and prompting strategies on corruption allegation classification using thai-specialized typhoon2 language models

IF 4.9
Patipan Sriphon , Pattrawut Khunwipusit , Bamisaye Mayowa Emmanuel , Issara Sereewatthanawut
{"title":"Evaluating the impact of model scale and prompting strategies on corruption allegation classification using thai-specialized typhoon2 language models","authors":"Patipan Sriphon ,&nbsp;Pattrawut Khunwipusit ,&nbsp;Bamisaye Mayowa Emmanuel ,&nbsp;Issara Sereewatthanawut","doi":"10.1016/j.mlwa.2025.100743","DOIUrl":null,"url":null,"abstract":"<div><div>Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"22 ","pages":"Article 100743"},"PeriodicalIF":4.9000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025001264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.
使用泰国专门的台风2语言模型评估模型规模和提示策略对腐败指控分类的影响
在公共部门治理中,腐败投诉分类是一项关键而又资源密集型的任务,特别是在资源匮乏的语言环境中。本研究评估了来自台风2家族的泰国专业大型语言模型(llm)自动分类提交给泰国国家反腐败委员会(NACC)的腐败投诉的能力。三个变体——台风2 - 3b(基础)、台风2 - 3b(微调)和台风2 - 7b(基础)——在零次、一次和两次提示策略下进行了评估,并针对基于TF-IDF特征训练的强传统机器学习模型(Random Forest, XGBoost)进行了基准测试。结果重申了基于树的分类器的竞争力,它提供了一贯的高和稳定的性能。其中,两弹提示的台风2 - 7b模型的性能最平衡(Macro F1 = 0.514),突出了紧急少弹推理能力和对类不平衡的处理能力。相比之下,对较小的3B模型进行微调会导致严重的过拟合和少数族裔类别的显著退化。这些结果强调,在小的、不平衡的设置中,模型规模和提示设计比直接微调更可靠地驱动性能。该研究为在治理中部署可扩展且符合道德规范的人工智能提供了实践指导,表明尽管传统模型仍然是强大的基准,但大规模推动的法学硕士代表了未来公共部门创新的有希望的补充。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Machine learning with applications
Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications
自引率
0.00%
发文量
0
审稿时长
98 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信