Sensitive Data Classification of Imbalanced Short Text Based on Probability Distribution BERT in Electric power industry

2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA) Pub Date : 2023-03-01 DOI:10.1109/PRMVIA58252.2023.00034

Wensi Zhang, Xiao Liang, Yifang Zhang, Hanchen Su

{"title":"Sensitive Data Classification of Imbalanced Short Text Based on Probability Distribution BERT in Electric power industry","authors":"Wensi Zhang, Xiao Liang, Yifang Zhang, Hanchen Su","doi":"10.1109/PRMVIA58252.2023.00034","DOIUrl":null,"url":null,"abstract":"The exploitation of big data in industrial fields faces several challenges, such as data privacy and security, data integration and interoperability, and data analysis and visualization. Data privacy and security is a major concern, as the data collected from industrial fields often contain sensitive information. Due to the particularity of the industrial field, there are challenges in the utilization of big data. 1. The distribution of different categories data is extremely uneven; 2. There are a large number of industry terms in the short texts that constitute the metadata, which makes semantic representation difficult. These two challenges have a large impact on the application performance of existing models. In order to resolve the problems above, this paper proposes a pre-training model based on probability distribution, which for the classification of sensitive data in the power industry. The model consists of three modules: 1. The data enhancement module adopts the technology of synonym expansion and noise introduction, so that the model can extract the classification features of sensitive data with a small proportion; 2. The pre-training module adopts the BERT model, which can obtain the semantics of industry terms in short texts; 3. The probability prediction module is used to regularize the distribution of test data to meet the training data. Compared with the traditional classification model and the classification model based on deep learning, the F1-score can be improved by 36.68% and 6.39%.","PeriodicalId":221346,"journal":{"name":"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRMVIA58252.2023.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The exploitation of big data in industrial fields faces several challenges, such as data privacy and security, data integration and interoperability, and data analysis and visualization. Data privacy and security is a major concern, as the data collected from industrial fields often contain sensitive information. Due to the particularity of the industrial field, there are challenges in the utilization of big data. 1. The distribution of different categories data is extremely uneven; 2. There are a large number of industry terms in the short texts that constitute the metadata, which makes semantic representation difficult. These two challenges have a large impact on the application performance of existing models. In order to resolve the problems above, this paper proposes a pre-training model based on probability distribution, which for the classification of sensitive data in the power industry. The model consists of three modules: 1. The data enhancement module adopts the technology of synonym expansion and noise introduction, so that the model can extract the classification features of sensitive data with a small proportion; 2. The pre-training module adopts the BERT model, which can obtain the semantics of industry terms in short texts; 3. The probability prediction module is used to regularize the distribution of test data to meet the training data. Compared with the traditional classification model and the classification model based on deep learning, the F1-score can be improved by 36.68% and 6.39%.

查看原文本刊更多论文

基于概率分布BERT的电力工业不平衡短文本敏感数据分类

工业领域的大数据开发面临着数据隐私与安全、数据集成与互操作性、数据分析与可视化等诸多挑战。数据隐私和安全是一个主要问题，因为从工业领域收集的数据通常包含敏感信息。由于工业领域的特殊性，大数据的利用存在挑战。1. 不同类别的数据分布极不均匀;2. 在构成元数据的短文本中有大量的行业术语，这使得语义表示变得困难。这两个挑战对现有模型的应用性能有很大的影响。为了解决上述问题，本文提出了一种基于概率分布的电力行业敏感数据分类预训练模型。该模型由三个模块组成:1。数据增强模块采用同义词扩展和噪声引入技术，使模型能够以小比例提取敏感数据的分类特征;2. 预训练模块采用BERT模型，可以在短文本中获取行业术语的语义;3.概率预测模块用于正则化测试数据的分布以满足训练数据。与传统分类模型和基于深度学习的分类模型相比，f1得分分别提高了36.68%和6.39%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)

自引率

0.00%

发文量