{"title":"Small Language Models can Outperform Humans in Short Creative Writing: A Study Comparing SLMs with Humans and LLMs","authors":"Guillermo Marco, Luz Rello, Julio Gonzalo","doi":"arxiv-2409.11547","DOIUrl":"https://doi.org/arxiv-2409.11547","url":null,"abstract":"In this paper, we evaluate the creative fiction writing abilities of a\u0000fine-tuned small language model (SLM), BART Large, and compare its performance\u0000to humans and two large language models (LLMs): GPT-3.5 and GPT-4o. Our\u0000evaluation consists of two experiments: (i) a human evaluation where readers\u0000assess the stories generated by the SLM compared to human-written stories, and\u0000(ii) a qualitative linguistic analysis comparing the textual characteristics of\u0000the stories generated by the different models. In the first experiment, we\u0000asked 68 participants to rate short stories generated by the models and humans\u0000along dimensions such as grammaticality, relevance, creativity, and\u0000attractiveness. BART Large outperformed human writers in most aspects, except\u0000creativity, with an overall score of 2.11 compared to 1.85 for human-written\u0000texts -- a 14% improvement. In the second experiment, the qualitative analysis\u0000revealed that, while GPT-4o exhibited near-perfect internal and external\u0000coherence, it tended to produce more predictable narratives, with only 3% of\u0000its stories seen as novel. In contrast, 15% of BART's stories were considered\u0000novel, indicating a higher degree of creativity despite its smaller model size.\u0000This study provides both quantitative and qualitative insights into how model\u0000size and fine-tuning influence the balance between creativity, fluency, and\u0000coherence in creative writing tasks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SAGED: A Holistic Bias-Benchmarking Pipeline for Language Models with Customisable Fairness Calibration","authors":"Xin Guan, Nathaniel Demchak, Saloni Gupta, Ze Wang, Ediz Ertekin Jr., Adriano Koshiyama, Emre Kazim, Zekun Wu","doi":"arxiv-2409.11149","DOIUrl":"https://doi.org/arxiv-2409.11149","url":null,"abstract":"The development of unbiased large language models is widely recognized as\u0000crucial, yet existing benchmarks fall short in detecting biases due to limited\u0000scope, contamination, and lack of a fairness baseline. SAGED(-Bias) is the\u0000first holistic benchmarking pipeline to address these problems. The pipeline\u0000encompasses five core stages: scraping materials, assembling benchmarks,\u0000generating responses, extracting numeric features, and diagnosing with\u0000disparity metrics. SAGED includes metrics for max disparity, such as impact\u0000ratio, and bias concentration, such as Max Z-scores. Noticing that assessment\u0000tool bias and contextual bias in prompts can distort evaluation, SAGED\u0000implements counterfactual branching and baseline calibration for mitigation.\u0000For demonstration, we use SAGED on G20 Countries with popular 8b-level models\u0000including Gemma2, Llama3.1, Mistral, and Qwen2. With sentiment analysis, we\u0000find that while Mistral and Qwen2 show lower max disparity and higher bias\u0000concentration than Gemma2 and Llama3.1, all models are notably biased against\u0000countries like Russia and (except for Qwen2) China. With further experiments to\u0000have models role-playing U.S. (vice-/former-) presidents, we see bias amplifies\u0000and shifts in heterogeneous directions. Moreover, we see Qwen2 and Mistral not\u0000engage in role-playing, while Llama3.1 and Gemma2 role-play Trump notably more\u0000intensively than Biden and Harris, indicating role-playing performance bias in\u0000these models.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262357","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Norm of Mean Contextualized Embeddings Determines their Variance","authors":"Hiroaki Yamagiwa, Hidetoshi Shimodaira","doi":"arxiv-2409.11253","DOIUrl":"https://doi.org/arxiv-2409.11253","url":null,"abstract":"Contextualized embeddings vary by context, even for the same token, and form\u0000a distribution in the embedding space. To analyze this distribution, we focus\u0000on the norm of the mean embedding and the variance of the embeddings. In this\u0000study, we first demonstrate that these values follow the well-known formula for\u0000variance in statistics and provide an efficient sequential computation method.\u0000Then, by observing embeddings from intermediate layers of several Transformer\u0000models, we found a strong trade-off relationship between the norm and the\u0000variance: as the mean embedding becomes closer to the origin, the variance\u0000increases. This trade-off is likely influenced by the layer normalization\u0000mechanism used in Transformer models. Furthermore, when the sets of token\u0000embeddings are treated as clusters, we show that the variance of the entire\u0000embedding set can theoretically be decomposed into the within-cluster variance\u0000and the between-cluster variance. We found experimentally that as the layers of\u0000Transformer models deepen, the embeddings move farther from the origin, the\u0000between-cluster variance relatively decreases, and the within-cluster variance\u0000relatively increases. These results are consistent with existing studies on the\u0000anisotropy of the embedding spaces across layers.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Task Arithmetic for Language Expansion in Speech Translation","authors":"Yao-Fei Cheng, Hayato Futami, Yosuke Kashiwagi, Emiru Tsunoo, Wen Shen Teo, Siddhant Arora, Shinji Watanabe","doi":"arxiv-2409.11274","DOIUrl":"https://doi.org/arxiv-2409.11274","url":null,"abstract":"Recent advances in large language models (LLMs) have gained interest in\u0000speech-text multimodal foundation models, achieving strong performance on\u0000instruction-based speech translation (ST). However, expanding language pairs\u0000from an existing instruction-tuned ST system is costly due to the necessity of\u0000re-training on a combination of new and previous datasets. We propose to expand\u0000new language pairs by merging the model trained on new language pairs and the\u0000existing model, using task arithmetic. We find that the direct application of\u0000task arithmetic for ST causes the merged model to fail to follow instructions;\u0000thus, generating translation in incorrect languages. To eliminate language\u0000confusion, we propose an augmented task arithmetic method that merges an\u0000additional language control model. It is trained to generate the correct target\u0000language token following the instructions. Our experiments demonstrate that our\u0000proposed language control model can achieve language expansion by eliminating\u0000language confusion. In our MuST-C and CoVoST-2 experiments, it shows up to 4.66\u0000and 4.92 BLEU scores improvement, respectively. In addition, we demonstrate the\u0000use of our task arithmetic framework can expand to a language pair where\u0000neither paired ST training data nor a pre-trained ST model is available. We\u0000first synthesize the ST system from machine translation (MT) systems via task\u0000analogy, then merge the synthesized ST system to the existing ST model.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan
{"title":"Says Who? Effective Zero-Shot Annotation of Focalization","authors":"Rebecca M. M. Hicke, Yuri Bizzoni, Pascale Feldkamp, Ross Deans Kristensen-McLachlan","doi":"arxiv-2409.11390","DOIUrl":"https://doi.org/arxiv-2409.11390","url":null,"abstract":"Focalization, the perspective through which narrative is presented, is\u0000encoded via a wide range of lexico-grammatical features and is subject to\u0000reader interpretation. Moreover, trained readers regularly disagree on\u0000interpretations, suggesting that this problem may be computationally\u0000intractable. In this paper, we provide experiments to test how well\u0000contemporary Large Language Models (LLMs) perform when annotating literary\u0000texts for focalization mode. Despite the challenging nature of the task, LLMs\u0000show comparable performance to trained human annotators in our experiments. We\u0000provide a case study working with the novels of Stephen King to demonstrate the\u0000usefulness of this approach for computational literary studies, illustrating\u0000how focalization can be studied at scale.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li
{"title":"CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration","authors":"Jiahui Gao, Renjie Pi, Tianyang Han, Han Wu, Lanqing Hong, Lingpeng Kong, Xin Jiang, Zhenguo Li","doi":"arxiv-2409.11365","DOIUrl":"https://doi.org/arxiv-2409.11365","url":null,"abstract":"The deployment of multimodal large language models (MLLMs) has demonstrated\u0000remarkable success in engaging in conversations involving visual inputs, thanks\u0000to the superior power of large language models (LLMs). Those MLLMs are\u0000typically built based on the LLMs, with an image encoder to process images into\u0000the token embedding space of the LLMs. However, the integration of visual\u0000modality has introduced a unique vulnerability: the MLLM becomes susceptible to\u0000malicious visual inputs and prone to generating sensitive or harmful responses,\u0000even though the LLM has been trained on textual dataset to align with human\u0000value. In this paper, we first raise the question: ``Do the MLLMs possess\u0000safety-awareness against malicious image inputs?\". We find that after adding a\u0000principle that specifies the safety requirement into the input of the MLLM, the\u0000model's safety awareness becomes boosted. This phenomenon verifies the\u0000existence of MLLM's safety-awareness against image inputs, it is only weakened\u0000by the modality gap. We then introduce a simple yet effective technique termed\u0000CoCA, which amplifies the safety-awareness of the MLLM by calibrating its\u0000output distribution. Our proposed strategy helps the model reclaim its original\u0000safety awareness without losing its original capabilities. We verify the\u0000effectiveness of our approach on both multimodal safety and understanding\u0000benchmarks.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee
{"title":"Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement","authors":"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee","doi":"arxiv-2409.11378","DOIUrl":"https://doi.org/arxiv-2409.11378","url":null,"abstract":"Finetuning large language models on instruction data is crucial for enhancing\u0000pre-trained knowledge and improving instruction-following capabilities. As\u0000instruction datasets proliferate, selecting optimal data for effective training\u0000becomes increasingly important. This work addresses the question: How can we\u0000determine the optimal subset of data for effective training? While existing\u0000research often emphasizes local criteria like instance quality for subset\u0000selection, we argue that a global approach focused on data diversity is more\u0000critical. Our method employs k-means clustering to ensure the selected subset\u0000effectively represents the full dataset. We propose an iterative refinement\u0000method inspired by active learning techniques to resample instances from\u0000clusters, reassessing each cluster's importance and sampling weight in every\u0000training iteration. This approach reduces the effect of outliers and\u0000automatically filters out clusters containing low-quality data. Through\u0000extensive evaluation across natural language reasoning, general world\u0000knowledge, code and math reasoning tasks, and by fine-tuning models from\u0000various families, we observe consistent improvements, achieving a 7% increase\u0000over random selection and a 3.8% improvement over state-of-the-art sampling\u0000methods. Our work highlights the significance of diversity-first sampling when\u0000finetuning LLMs to enhance performance across a broad array of evaluation\u0000tasks. Our code is available at\u0000https://github.com/for-ai/iterative-data-selection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Leveraging Distillation Techniques for Document Understanding: A Case Study with FLAN-T5","authors":"Marcel Lamott, Muhammad Armaghan Shakir","doi":"arxiv-2409.11282","DOIUrl":"https://doi.org/arxiv-2409.11282","url":null,"abstract":"The surge of digital documents in various formats, including less\u0000standardized documents such as business reports and environmental assessments,\u0000underscores the growing importance of Document Understanding. While Large\u0000Language Models (LLMs) have showcased prowess across diverse natural language\u0000processing tasks, their direct application to Document Understanding remains a\u0000challenge. Previous research has demonstrated the utility of LLMs in this\u0000domain, yet their significant computational demands make them challenging to\u0000deploy effectively. Additionally, proprietary Blackbox LLMs often outperform\u0000their open-source counterparts, posing a barrier to widespread accessibility.\u0000In this paper, we delve into the realm of document understanding, leveraging\u0000distillation methods to harness the power of large LLMs while accommodating\u0000computational limitations. Specifically, we present a novel approach wherein we\u0000distill document understanding knowledge from the proprietary LLM ChatGPT into\u0000FLAN-T5. Our methodology integrates labeling and curriculum-learning mechanisms\u0000to facilitate efficient knowledge transfer. This work contributes to the\u0000advancement of document understanding methodologies by offering a scalable\u0000solution that bridges the gap between resource-intensive LLMs and practical\u0000applications. Our findings underscore the potential of distillation techniques\u0000in facilitating the deployment of sophisticated language models in real-world\u0000scenarios, thereby fostering advancements in natural language processing and\u0000document comprehension domains.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu
{"title":"SpMis: An Investigation of Synthetic Spoken Misinformation Detection","authors":"Peizhuo Liu, Li Wang, Renqiang He, Haorui He, Lei Wang, Huadi Zheng, Jie Shi, Tong Xiao, Zhizheng Wu","doi":"arxiv-2409.11308","DOIUrl":"https://doi.org/arxiv-2409.11308","url":null,"abstract":"In recent years, speech generation technology has advanced rapidly, fueled by\u0000generative models and large-scale training techniques. While these developments\u0000have enabled the production of high-quality synthetic speech, they have also\u0000raised concerns about the misuse of this technology, particularly for\u0000generating synthetic misinformation. Current research primarily focuses on\u0000distinguishing machine-generated speech from human-produced speech, but the\u0000more urgent challenge is detecting misinformation within spoken content. This\u0000task requires a thorough analysis of factors such as speaker identity, topic,\u0000and synthesis. To address this need, we conduct an initial investigation into\u0000synthetic spoken misinformation detection by introducing an open-source\u0000dataset, SpMis. SpMis includes speech synthesized from over 1,000 speakers\u0000across five common topics, utilizing state-of-the-art text-to-speech systems.\u0000Although our results show promising detection capabilities, they also reveal\u0000substantial challenges for practical implementation, underscoring the\u0000importance of ongoing research in this critical area.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria
{"title":"Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse","authors":"Maojia Song, Shang Hong Sim, Rishabh Bhardwaj, Hai Leong Chieu, Navonil Majumder, Soujanya Poria","doi":"arxiv-2409.11242","DOIUrl":"https://doi.org/arxiv-2409.11242","url":null,"abstract":"LLMs are an integral part of retrieval-augmented generation (RAG) systems.\u0000While many studies focus on evaluating the quality of end-to-end RAG systems,\u0000there is a lack of research on understanding the appropriateness of an LLM for\u0000the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a\u0000holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show\u0000that various prompting methods, such as in-context learning, fail to adapt LLMs\u0000effectively to the RAG task. Thus, we propose Trust-Align, a framework to align\u0000LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly\u0000outperforms open-source LLMs of comparable sizes on ASQA (up 10.7), QAMPARI (up\u000029.2) and ELI5 (up 14.9). We release our code at:\u0000https://github.com/declare-lab/trust-align.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142262352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}