Evaluating the impact of model scale and prompting strategies on corruption allegation classification using thai-specialized typhoon2 language models

IF 4.9

Machine learning with applications Pub Date : 2025-09-29 DOI:10.1016/j.mlwa.2025.100743

Patipan Sriphon , Pattrawut Khunwipusit , Bamisaye Mayowa Emmanuel , Issara Sereewatthanawut

{"title":"Evaluating the impact of model scale and prompting strategies on corruption allegation classification using thai-specialized typhoon2 language models","authors":"Patipan Sriphon , Pattrawut Khunwipusit , Bamisaye Mayowa Emmanuel , Issara Sereewatthanawut","doi":"10.1016/j.mlwa.2025.100743","DOIUrl":null,"url":null,"abstract":"<div><div>Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"22 ","pages":"Article 100743"},"PeriodicalIF":4.9000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025001264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.

查看原文本刊更多论文

使用泰国专门的台风2语言模型评估模型规模和提示策略对腐败指控分类的影响

在公共部门治理中，腐败投诉分类是一项关键而又资源密集型的任务，特别是在资源匮乏的语言环境中。本研究评估了来自台风2家族的泰国专业大型语言模型（llm）自动分类提交给泰国国家反腐败委员会（NACC）的腐败投诉的能力。三个变体——台风2 - 3b（基础）、台风2 - 3b（微调）和台风2 - 7b（基础）——在零次、一次和两次提示策略下进行了评估，并针对基于TF-IDF特征训练的强传统机器学习模型（Random Forest, XGBoost）进行了基准测试。结果重申了基于树的分类器的竞争力，它提供了一贯的高和稳定的性能。其中，两弹提示的台风2 - 7b模型的性能最平衡（Macro F1 = 0.514），突出了紧急少弹推理能力和对类不平衡的处理能力。相比之下，对较小的3B模型进行微调会导致严重的过拟合和少数族裔类别的显著退化。这些结果强调，在小的、不平衡的设置中，模型规模和提示设计比直接微调更可靠地驱动性能。该研究为在治理中部署可扩展且符合道德规范的人工智能提供了实践指导，表明尽管传统模型仍然是强大的基准，但大规模推动的法学硕士代表了未来公共部门创新的有希望的补充。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine learning with applications Management Science and Operations Research, Artificial Intelligence, Computer Science Applications

自引率

0.00%

发文量

审稿时长

98 days