使用预训练通用句子编码器模型的多语言文本分类器

IF 0.3 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Radio Electronics Computer Science Control Pub Date : 2022-10-16 DOI:10.15588/1607-3274-2022-3-10

O. V. Orlovskiy, K. Sohrab, S. Ostapov, K. P. Hazdyuk, L. Shumylyak

{"title":"使用预训练通用句子编码器模型的多语言文本分类器","authors":"O. V. Orlovskiy, K. Sohrab, S. Ostapov, K. P. Hazdyuk, L. Shumylyak","doi":"10.15588/1607-3274-2022-3-10","DOIUrl":null,"url":null,"abstract":"Context. Online platforms and environments continue to generate ever-increasing content. The task of automating the moderation of user-generated content continues to be relevant. Of particular note are cases in which, for one reason or another, there is a very small amount of data to teach the classifier. To achieve results under such conditions, it is important to involve the classifier pre-trained models, which were trained on a large amount of data from a wide range. This paper deals with the use of the pre-trained multilingual Universal Sentence Encoder (USE) model as a component of the developed classifier and the affect of hyperparameters on the classification accuracy when learning on a small data amount (~ 0.05% of the dataset). \nObjective. The goal of this paper is the investigation of the pre-trained multilingual model and optimal hyperparameters influence for learning the text data classifier on the classification result. \nMethod. To solve this problem, a relatively new approach to few-shot learning has recently been used – learning with a relatively small number of examples. Since text data is still the dominant way of transmitting information, the study of the possibilities of constructing a classifier of text data when learning from a small number of examples (~ 0.002–0.05% of the data set) is an actual problem. \nResults. It is shown that even with a small number of examples for learning (36 per class) due to the use of USE and optimal configuration in learning can achieve high accuracy of classification on English and Russian data, which is extremely important when it is impossible to collect your own large data set. The influence of the approach using USE and a set of different configurations of hyperparameters on the result of the text data classifier on the example of English and Russian data sets is evaluated. \nConclusions. During the experiments, a significant degree of relevance of the correct selection of hyperparameters is shown. In particular, this paper considered the batch size, optimizer, number of learning epochs and the percentage of data from the set taken to train the classifier. In the process of experimentation, the optimal configuration of hyperparameters was selected, according to which 86.46% accuracy of classification on the Russian-language data set and 91.13% on the English-language data, respectively, can be achieved in ten seconds of training (training time can be significantly affected by technical means used).","PeriodicalId":43783,"journal":{"name":"Radio Electronics Computer Science Control","volume":"51 1","pages":""},"PeriodicalIF":0.3000,"publicationDate":"2022-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MULTILINGUAL TEXT CLASSIFIER USING PRE-TRAINED UNIVERSAL SENTENCE ENCODER MODEL\",\"authors\":\"O. V. Orlovskiy, K. Sohrab, S. Ostapov, K. P. Hazdyuk, L. Shumylyak\",\"doi\":\"10.15588/1607-3274-2022-3-10\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Context. Online platforms and environments continue to generate ever-increasing content. The task of automating the moderation of user-generated content continues to be relevant. Of particular note are cases in which, for one reason or another, there is a very small amount of data to teach the classifier. To achieve results under such conditions, it is important to involve the classifier pre-trained models, which were trained on a large amount of data from a wide range. This paper deals with the use of the pre-trained multilingual Universal Sentence Encoder (USE) model as a component of the developed classifier and the affect of hyperparameters on the classification accuracy when learning on a small data amount (~ 0.05% of the dataset). \\nObjective. The goal of this paper is the investigation of the pre-trained multilingual model and optimal hyperparameters influence for learning the text data classifier on the classification result. \\nMethod. To solve this problem, a relatively new approach to few-shot learning has recently been used – learning with a relatively small number of examples. Since text data is still the dominant way of transmitting information, the study of the possibilities of constructing a classifier of text data when learning from a small number of examples (~ 0.002–0.05% of the data set) is an actual problem. \\nResults. It is shown that even with a small number of examples for learning (36 per class) due to the use of USE and optimal configuration in learning can achieve high accuracy of classification on English and Russian data, which is extremely important when it is impossible to collect your own large data set. The influence of the approach using USE and a set of different configurations of hyperparameters on the result of the text data classifier on the example of English and Russian data sets is evaluated. \\nConclusions. During the experiments, a significant degree of relevance of the correct selection of hyperparameters is shown. In particular, this paper considered the batch size, optimizer, number of learning epochs and the percentage of data from the set taken to train the classifier. In the process of experimentation, the optimal configuration of hyperparameters was selected, according to which 86.46% accuracy of classification on the Russian-language data set and 91.13% on the English-language data, respectively, can be achieved in ten seconds of training (training time can be significantly affected by technical means used).\",\"PeriodicalId\":43783,\"journal\":{\"name\":\"Radio Electronics Computer Science Control\",\"volume\":\"51 1\",\"pages\":\"\"},\"PeriodicalIF\":0.3000,\"publicationDate\":\"2022-10-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radio Electronics Computer Science Control\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15588/1607-3274-2022-3-10\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radio Electronics Computer Science Control","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15588/1607-3274-2022-3-10","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

摘要

上下文。网络平台和环境不断产生越来越多的内容。自动审核用户生成内容的任务仍然具有相关性。特别值得注意的是，由于这样或那样的原因，只有非常少的数据可以教分类器。为了在这种情况下获得结果，涉及分类器预训练模型是很重要的，这些模型是在广泛的大量数据上训练的。本文讨论了使用预训练的多语言通用句子编码器(use)模型作为开发的分类器的组成部分，以及在小数据量(约0.05%的数据集)上学习时超参数对分类精度的影响。目标。本文的目的是研究预训练的多语言模型和学习文本数据分类器的最优超参数对分类结果的影响。方法。为了解决这个问题，最近使用了一种相对较新的少采样学习方法——使用相对较少的样本进行学习。由于文本数据仍然是传递信息的主要方式，研究在从少量示例(约0.002-0.05%的数据集)中学习时构建文本数据分类器的可能性是一个实际问题。结果。研究表明，由于使用use和学习中的最佳配置，即使使用少量的示例进行学习(每类36个)，也可以实现英语和俄语数据的高精度分类，这在不可能收集自己的大数据集时非常重要。以英语和俄语数据集为例，评估了使用USE和一组不同超参数配置的方法对文本数据分类器结果的影响。结论。在实验中，超参数的正确选择具有显著的相关性。特别地，本文考虑了批处理大小、优化器、学习周期数和用于训练分类器的数据集的百分比。在实验过程中，选择了最优的超参数配置，在10秒的训练时间内，俄语数据集的分类准确率达到86.46%，英语数据集的分类准确率达到91.13%(使用的技术手段对训练时间有显著影响)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MULTILINGUAL TEXT CLASSIFIER USING PRE-TRAINED UNIVERSAL SENTENCE ENCODER MODEL

Context. Online platforms and environments continue to generate ever-increasing content. The task of automating the moderation of user-generated content continues to be relevant. Of particular note are cases in which, for one reason or another, there is a very small amount of data to teach the classifier. To achieve results under such conditions, it is important to involve the classifier pre-trained models, which were trained on a large amount of data from a wide range. This paper deals with the use of the pre-trained multilingual Universal Sentence Encoder (USE) model as a component of the developed classifier and the affect of hyperparameters on the classification accuracy when learning on a small data amount (~ 0.05% of the dataset). Objective. The goal of this paper is the investigation of the pre-trained multilingual model and optimal hyperparameters influence for learning the text data classifier on the classification result. Method. To solve this problem, a relatively new approach to few-shot learning has recently been used – learning with a relatively small number of examples. Since text data is still the dominant way of transmitting information, the study of the possibilities of constructing a classifier of text data when learning from a small number of examples (~ 0.002–0.05% of the data set) is an actual problem. Results. It is shown that even with a small number of examples for learning (36 per class) due to the use of USE and optimal configuration in learning can achieve high accuracy of classification on English and Russian data, which is extremely important when it is impossible to collect your own large data set. The influence of the approach using USE and a set of different configurations of hyperparameters on the result of the text data classifier on the example of English and Russian data sets is evaluated. Conclusions. During the experiments, a significant degree of relevance of the correct selection of hyperparameters is shown. In particular, this paper considered the batch size, optimizer, number of learning epochs and the percentage of data from the set taken to train the classifier. In the process of experimentation, the optimal configuration of hyperparameters was selected, according to which 86.46% accuracy of classification on the Russian-language data set and 91.13% on the English-language data, respectively, can be achieved in ten seconds of training (training time can be significantly affected by technical means used).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Radio Electronics Computer Science Control COMPUTER SCIENCE, HARDWARE & ARCHITECTURE-

自引率

20.00%

发文量

审稿时长

12 weeks