T-LLaMA：基于LLaMA2的藏文大语言模型

IF 5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Complex & Intelligent Systems Pub Date : 2024-12-19 DOI:10.1007/s40747-024-01641-7

Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen

{"title":"T-LLaMA：基于LLaMA2的藏文大语言模型","authors":"Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen","doi":"10.1007/s40747-024-01641-7","DOIUrl":null,"url":null,"abstract":"<p>The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA.</p>","PeriodicalId":10524,"journal":{"name":"Complex & Intelligent Systems","volume":"10 1","pages":""},"PeriodicalIF":5.0000,"publicationDate":"2024-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"T-LLaMA: a Tibetan large language model based on LLaMA2\",\"authors\":\"Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, Jun Shen\",\"doi\":\"10.1007/s40747-024-01641-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA.</p>\",\"PeriodicalId\":10524,\"journal\":{\"name\":\"Complex & Intelligent Systems\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-12-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Complex & Intelligent Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s40747-024-01641-7\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Complex & Intelligent Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s40747-024-01641-7","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

ChatGPT 和 GPT-4 的出现引起了人们对大型语言模型（LLM）研究的极大兴趣，它们在对话系统、机器翻译和研究论文摘要等各种应用中表现出了卓越的性能。然而，当它们应用于低资源语言，尤其是像藏语这样的学术研究环境时，其功效就会大打折扣。在本研究中，我们基于高效的预训练技术训练了藏文 LLaMA（T-LaMA）模型，用于三个下游任务：文本分类、新闻文本生成和自动文本摘要。为了解决缺乏语料库的问题，我们构建了一个包含 22 亿字符的藏文数据集。此外，我们还使用 SentencePiece 扩展了藏语词汇，从而增强了 META AI 的 LLaMA2 词汇量。值得注意的是，在公开数据集《西藏新闻分类语料库》上，文本分类任务的准确率达到了最先进的（SOTA）79.8%。此外，对生成的 500 个样本进行的人工审核表明，在新闻文本生成和文本摘要任务中都取得了令人满意的结果。据我们所知，T-LaMA 是藏语自然语言处理（NLP）领域第一个参数在十亿范围内的大规模语言模型。我们公开提供经过训练的模型，希望这一贡献不仅能填补藏文大规模语言模型领域的空白，还能为藏文 NLP 界计算资源有限的研究人员提供基础模型。T-LaMA 模型可在 https://huggingface.co/Pagewood/T-LLaMA 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

T-LLaMA: a Tibetan large language model based on LLaMA2

The advent of ChatGPT and GPT-4 has generated substantial interest in large language model (LLM) research, showcasing remarkable performance in various applications such as conversation systems, machine translation, and research paper summarization. However, their efficacy diminishes when applied to low-resource languages, particularly in academic research contexts like Tibetan. In this study, we trained Tibetan LLaMA (T-LLaMA), a model based on efficient pre-training technology for three downstream tasks: text classification, news text generation and automatic text summarization. To address the lack of corpus, we constructed a Tibetan dataset comprising 2.2 billion characters. Furthermore, we augmented the vocabulary of LLaMA2 from META AI by expanding the Tibetan vocabulary using SentencePiece. Notably, the text classification task attains a state-of-the-art (SOTA) accuracy of 79.8% on a publicly available dataset Tibetan News Classification Corpus. In addition, manual review of 500 generated samples indicates satisfactory results in both news text generation and text summarization tasks. To our knowledge, T-LLaMA stands as the first large-scale language model in Tibetan natural language processing (NLP) with parameters in the billion range. We openly provide our trained models, anticipating that this contribution not only fills gaps in the Tibetan large-scale language model domain but also serves as foundational models for researchers with limited computational resources in the Tibetan NLP community. The T-LLaMA model is available at https://huggingface.co/Pagewood/T-LLaMA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Complex & Intelligent Systems COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

9.60

自引率

10.30%

发文量

297

期刊介绍： Complex & Intelligent Systems aims to provide a forum for presenting and discussing novel approaches, tools and techniques meant for attaining a cross-fertilization between the broad fields of complex systems, computational simulation, and intelligent analytics and visualization. The transdisciplinary research that the journal focuses on will expand the boundaries of our understanding by investigating the principles and processes that underlie many of the most profound problems facing society today.