基于Transformer的葡萄牙语新闻文本分类模型

Isabel N. Santana, R. S. Oliveira, E. G. S. Nascimento
{"title":"基于Transformer的葡萄牙语新闻文本分类模型","authors":"Isabel N. Santana, R. S. Oliveira, E. G. S. Nascimento","doi":"10.54808/jsci.20.05.33","DOIUrl":null,"url":null,"abstract":"This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.","PeriodicalId":30249,"journal":{"name":"Journal of Systemics Cybernetics and Informatics","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Text Classification of News Using Transformer-based Models for Portuguese\",\"authors\":\"Isabel N. Santana, R. S. Oliveira, E. G. S. Nascimento\",\"doi\":\"10.54808/jsci.20.05.33\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.\",\"PeriodicalId\":30249,\"journal\":{\"name\":\"Journal of Systemics Cybernetics and Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Systemics Cybernetics and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.54808/jsci.20.05.33\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Systemics Cybernetics and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54808/jsci.20.05.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

摘要

这项工作建议使用一种名为BERTimbau的基于变压器的微调自然语言处理(NLP)模型,从巴西报纸上发布的文本中生成单词嵌入,创建一个强大的NLP模型来对葡萄牙语新闻进行分类,这项任务对人类来说是一项成本高昂的任务。为了评估这种方法,除了通过微调的BERTimbau生成嵌入之外,还使用Word2Vec技术进行了比较分析。这项工作的第一步是使用K-means和TF-IDF技术,将新闻从19个类别重新排列到10个类别,以减少语料库中类别失衡的存在。在Word2Vec步骤中,应用了CBOW和Skip gram架构。在BERTimbau和Word2Vec步骤中,使用Doc2Vec方法将每条新闻表示为唯一的嵌入,为每条新闻生成文档嵌入。采用指标准确度、加权准确度、准确度、召回率、F1评分、AUC ROC和AUC PRC来评估结果。值得注意的是,经过微调的BERTimbau捕捉到了不同类别文本中的差异,表明基于该模型的分类模型比其他探索技术具有更好的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Text Classification of News Using Transformer-based Models for Portuguese
This work proposes the use of a fine-tuned Transformers-based Natural Language Processing (NLP) model called BERTimbau to generate the word embeddings from texts published in a Brazilian newspaper, to create a robust NLP model to classify news in Portuguese, a task that is costly for humans to perform for big amounts of data. To assess this approach, besides the generation of embeddings by the fine-tuned BERTimbau, a comparative analysis was conducted using the Word2Vec technique. The first step of the work was to rearrange news from nineteen to ten categories to reduce the existence of class imbalance in the corpus, using the K-means and TF-IDF techniques. In the Word2Vec step, the CBOW and Skip-gram architectures were applied. In BERTimbau and Word2Vec steps, the Doc2Vec method was used to represent each news as a unique embedding, generating a document embedding for each news. Metrics accuracy, weighted accuracy, precision, recall, F1-Score, AUC ROC and AUC PRC were applied to evaluate the results. It was noticed that the fine-tuned BERTimbau captured distinctions in the texts of the different categories, showing that the classification model based on this model has a superior performance than the other explored techniques.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
44
审稿时长
12 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信