{"title":"How do LLMs perform on Turkish? A multi-faceted multi-prompt evaluation","authors":"Mustafa Burak Topal , Aysun Bozanta , Ayşe Başar","doi":"10.1016/j.eswa.2025.127421","DOIUrl":null,"url":null,"abstract":"<div><div>Turkish is a resourced language, but it remains underresearched, causing it to lag behind recent advances focused on LLM in NLP. Comprehensive evaluations and standardized benchmarks are crucial for advancing Turkish LLMs, as they help identify strengths and weaknesses. This study aims to evaluate large language models (LLMs) in Turkish, focusing on their performance in understanding and trustworthiness tasks. Our analysis examines the models’ prompt robustness and compares fine-tuned LLMs with their chat-based counterparts. We evaluated 10 open-source models for 11 different tasks using 17 datasets. These data sets comprised original Turkish sources and translations from English. We also included Turkish and multilingual pre-trained language models (PLMs) as baselines for certain tasks. The gemma2-9b-it model outperformed other chat LLMs in both understanding and trustworthiness tasks. However, in fine-tuning experiments, no single model emerged as the best, with the top PLM achieving results comparable to the best LLM. Significant performance variations on paraphrased prompts highlight the need for improved robustness, which can be achieved by fine-tuning as our results suggest. Models like Trendyol-8B-chat-v2.0 and wiroai-turkish-llm-8b, adapted to Turkish via instruction tuning, often surpassed the LLMs they are based on. This suggests that adapting gemma2-9b-it to Turkish might lead to a model that is stronger than gemma-2-9b-it, the best model in our evaluation. This study evaluates Turkish LLMs and shares key insights from the experiments.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"279 ","pages":"Article 127421"},"PeriodicalIF":7.5000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0957417425010437","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Turkish is a resourced language, but it remains underresearched, causing it to lag behind recent advances focused on LLM in NLP. Comprehensive evaluations and standardized benchmarks are crucial for advancing Turkish LLMs, as they help identify strengths and weaknesses. This study aims to evaluate large language models (LLMs) in Turkish, focusing on their performance in understanding and trustworthiness tasks. Our analysis examines the models’ prompt robustness and compares fine-tuned LLMs with their chat-based counterparts. We evaluated 10 open-source models for 11 different tasks using 17 datasets. These data sets comprised original Turkish sources and translations from English. We also included Turkish and multilingual pre-trained language models (PLMs) as baselines for certain tasks. The gemma2-9b-it model outperformed other chat LLMs in both understanding and trustworthiness tasks. However, in fine-tuning experiments, no single model emerged as the best, with the top PLM achieving results comparable to the best LLM. Significant performance variations on paraphrased prompts highlight the need for improved robustness, which can be achieved by fine-tuning as our results suggest. Models like Trendyol-8B-chat-v2.0 and wiroai-turkish-llm-8b, adapted to Turkish via instruction tuning, often surpassed the LLMs they are based on. This suggests that adapting gemma2-9b-it to Turkish might lead to a model that is stronger than gemma-2-9b-it, the best model in our evaluation. This study evaluates Turkish LLMs and shares key insights from the experiments.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.