{"title":"Sensitive Data Classification of Imbalanced Short Text Based on Probability Distribution BERT in Electric power industry","authors":"Wensi Zhang, Xiao Liang, Yifang Zhang, Hanchen Su","doi":"10.1109/PRMVIA58252.2023.00034","DOIUrl":null,"url":null,"abstract":"The exploitation of big data in industrial fields faces several challenges, such as data privacy and security, data integration and interoperability, and data analysis and visualization. Data privacy and security is a major concern, as the data collected from industrial fields often contain sensitive information. Due to the particularity of the industrial field, there are challenges in the utilization of big data. 1. The distribution of different categories data is extremely uneven; 2. There are a large number of industry terms in the short texts that constitute the metadata, which makes semantic representation difficult. These two challenges have a large impact on the application performance of existing models. In order to resolve the problems above, this paper proposes a pre-training model based on probability distribution, which for the classification of sensitive data in the power industry. The model consists of three modules: 1. The data enhancement module adopts the technology of synonym expansion and noise introduction, so that the model can extract the classification features of sensitive data with a small proportion; 2. The pre-training module adopts the BERT model, which can obtain the semantics of industry terms in short texts; 3. The probability prediction module is used to regularize the distribution of test data to meet the training data. Compared with the traditional classification model and the classification model based on deep learning, the F1-score can be improved by 36.68% and 6.39%.","PeriodicalId":221346,"journal":{"name":"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms (PRMVIA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRMVIA58252.2023.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The exploitation of big data in industrial fields faces several challenges, such as data privacy and security, data integration and interoperability, and data analysis and visualization. Data privacy and security is a major concern, as the data collected from industrial fields often contain sensitive information. Due to the particularity of the industrial field, there are challenges in the utilization of big data. 1. The distribution of different categories data is extremely uneven; 2. There are a large number of industry terms in the short texts that constitute the metadata, which makes semantic representation difficult. These two challenges have a large impact on the application performance of existing models. In order to resolve the problems above, this paper proposes a pre-training model based on probability distribution, which for the classification of sensitive data in the power industry. The model consists of three modules: 1. The data enhancement module adopts the technology of synonym expansion and noise introduction, so that the model can extract the classification features of sensitive data with a small proportion; 2. The pre-training module adopts the BERT model, which can obtain the semantics of industry terms in short texts; 3. The probability prediction module is used to regularize the distribution of test data to meet the training data. Compared with the traditional classification model and the classification model based on deep learning, the F1-score can be improved by 36.68% and 6.39%.