{"title":"使用泰国专门的台风2语言模型评估模型规模和提示策略对腐败指控分类的影响","authors":"Patipan Sriphon , Pattrawut Khunwipusit , Bamisaye Mayowa Emmanuel , Issara Sereewatthanawut","doi":"10.1016/j.mlwa.2025.100743","DOIUrl":null,"url":null,"abstract":"<div><div>Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.</div></div>","PeriodicalId":74093,"journal":{"name":"Machine learning with applications","volume":"22 ","pages":"Article 100743"},"PeriodicalIF":4.9000,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating the impact of model scale and prompting strategies on corruption allegation classification using thai-specialized typhoon2 language models\",\"authors\":\"Patipan Sriphon , Pattrawut Khunwipusit , Bamisaye Mayowa Emmanuel , Issara Sereewatthanawut\",\"doi\":\"10.1016/j.mlwa.2025.100743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.</div></div>\",\"PeriodicalId\":74093,\"journal\":{\"name\":\"Machine learning with applications\",\"volume\":\"22 \",\"pages\":\"Article 100743\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-09-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Machine learning with applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2666827025001264\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine learning with applications","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666827025001264","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating the impact of model scale and prompting strategies on corruption allegation classification using thai-specialized typhoon2 language models
Corruption complaint classification is a critical yet resource-intensive task in public sector governance, particularly in low-resource linguistic environments. This study assesses the capacity of Thai-specialized large language models (LLMs) from the Typhoon2 family to automate the classification of corruption complaints submitted to Thailand’s National Anti-Corruption Commission (NACC). Three variants—Typhoon2–3B (base), Typhoon2–3B (fine-tuned), and Typhoon2–7B (base)—were evaluated under zero-shot, one-shot, and two-shot prompting strategies and benchmarked against strong traditional machine learning models (Random Forest, XGBoost) trained on TF-IDF features. Results reaffirm the competitiveness of tree-based classifiers, which delivered consistently high and stable performance. Among the LLMs, the Typhoon2–7B model with two-shot prompting achieved the most balanced performance (Macro F1 = 0.514), highlighting emergent few-shot reasoning capabilities and improved handling of class imbalance. By contrast, fine-tuning the smaller 3B model induced severe overfitting and significant degradation on minority classes. These outcomes emphasize that model scale and prompt design are more reliable drivers of performance than direct fine-tuning in small, imbalanced settings. The study contributes practical guidance for deploying scalable and ethically aligned AI in governance, demonstrating that while traditional models remain robust benchmarks, large-scale prompted LLMs represent a promising complement for future public sector innovation.