{"title":"小语言模型在创意短文写作中胜过人类:将 SLM 与人类和 LLM 进行比较的研究","authors":"Guillermo Marco, Luz Rello, Julio Gonzalo","doi":"arxiv-2409.11547","DOIUrl":null,"url":null,"abstract":"In this paper, we evaluate the creative fiction writing abilities of a\nfine-tuned small language model (SLM), BART Large, and compare its performance\nto humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\nevaluation consists of two experiments: (i) a human evaluation where readers\nassess the stories generated by the SLM compared to human-written stories, and\n(ii) a qualitative linguistic analysis comparing the textual characteristics of\nthe stories generated by the different models. In the first experiment, we\nasked 68 participants to rate short stories generated by the models and humans\nalong dimensions such as grammaticality, relevance, creativity, and\nattractiveness. BART Large outperformed human writers in most aspects, except\ncreativity, with an overall score of 2.11 compared to 1.85 for human-written\ntexts -- a 14% improvement. In the second experiment, the qualitative analysis\nrevealed that, while GPT-4o exhibited near-perfect internal and external\ncoherence, it tended to produce more predictable narratives, with only 3% of\nits stories seen as novel. In contrast, 15% of BART's stories were considered\nnovel, indicating a higher degree of creativity despite its smaller model size.\nThis study provides both quantitative and qualitative insights into how model\nsize and fine-tuning influence the balance between creativity, fluency, and\ncoherence in creative writing tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs\",\"authors\":\"Guillermo Marco, Luz Rello, Julio Gonzalo\",\"doi\":\"arxiv-2409.11547\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we evaluate the creative fiction writing abilities of a\\nfine-tuned small language model (SLM), BART Large, and compare its performance\\nto humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\\nevaluation consists of two experiments: (i) a human evaluation where readers\\nassess the stories generated by the SLM compared to human-written stories, and\\n(ii) a qualitative linguistic analysis comparing the textual characteristics of\\nthe stories generated by the different models. In the first experiment, we\\nasked 68 participants to rate short stories generated by the models and humans\\nalong dimensions such as grammaticality, relevance, creativity, and\\nattractiveness. BART Large outperformed human writers in most aspects, except\\ncreativity, with an overall score of 2.11 compared to 1.85 for human-written\\ntexts -- a 14% improvement. In the second experiment, the qualitative analysis\\nrevealed that, while GPT-4o exhibited near-perfect internal and external\\ncoherence, it tended to produce more predictable narratives, with only 3% of\\nits stories seen as novel. In contrast, 15% of BART's stories were considered\\nnovel, indicating a higher degree of creativity despite its smaller model size.\\nThis study provides both quantitative and qualitative insights into how model\\nsize and fine-tuning influence the balance between creativity, fluency, and\\ncoherence in creative writing tasks.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"3 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11547\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11547","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
在本文中,我们评估了经过精细调整的小语言模型(SLM)BART Large 的小说创作能力,并将其表现与人类和两种大语言模型(LLM)进行了比较:GPT-3.5 和 GPT-4o。评估包括两个实验:(i)人类评估,读者将 SLM 生成的故事与人类编写的故事进行比较评估;(ii)定性语言分析,比较不同模型生成的故事的文本特征。在第一个实验中,我们请 68 名参与者对模型和人类编写的短篇故事进行评分,评分标准包括语法性、相关性、创造性和吸引力。除创造性外,BART Large 在大多数方面的表现都优于人类写作者,总得分为 2.11 分,而人类写作的文本为 1.85 分,提高了 14%。在第二个实验中,定性分析显示,虽然 GPT-4o 表现出近乎完美的内部和外部一致性,但它倾向于产生更多可预测的叙事,只有 3% 的故事被认为是新颖的。这项研究从定量和定性两个方面揭示了模型大小和微调如何影响创意写作任务中创意、流畅性和一致性之间的平衡。
Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs
In this paper, we evaluate the creative fiction writing abilities of a
fine-tuned small language model (SLM), BART Large, and compare its performance
to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our
evaluation consists of two experiments: (i) a human evaluation where readers
assess the stories generated by the SLM compared to human-written stories, and
(ii) a qualitative linguistic analysis comparing the textual characteristics of
the stories generated by the different models. In the first experiment, we
asked 68 participants to rate short stories generated by the models and humans
along dimensions such as grammaticality, relevance, creativity, and
attractiveness. BART Large outperformed human writers in most aspects, except
creativity, with an overall score of 2.11 compared to 1.85 for human-written
texts -- a 14% improvement. In the second experiment, the qualitative analysis
revealed that, while GPT-4o exhibited near-perfect internal and external
coherence, it tended to produce more predictable narratives, with only 3% of
its stories seen as novel. In contrast, 15% of BART's stories were considered
novel, indicating a higher degree of creativity despite its smaller model size.
This study provides both quantitative and qualitative insights into how model
size and fine-tuning influence the balance between creativity, fluency, and
coherence in creative writing tasks.