{"title":"Few-shot Learning Named Entity Recognition of Pressure Sensor Patent Text Based on MLM","authors":"Yue Deng, Honghui Li, Xueliang Fu","doi":"10.1109/TOCS53301.2021.9688929","DOIUrl":null,"url":null,"abstract":"Abstract of patent text, as an important support for intellectual property protection, is an ideal data source for technology mining. Named entity recognition of patent text can reduce the workload of patent analysis, improve work efficiency, and provide effective technical means for patent discovery, patent promotion, patent infringement and other aspects. However, the technical terms of patent texts are difficult to be mined, extracted and labeled. Therefore, this paper proposes a few-shot learning named entity recognition method to solve the problem that the named entity recognition of pressure sensor patent text lacks sufficient annotation data.This method uses MLM (Masked Language Model) pretraining method of BERT Model, selects a small part of token to mask each time, and then repeatedly trains on the same sample, finally obtains the training embedding of bidirectional fusion information on massive continuous corpus. Then the CRF layer is used to decode and finally the prediction tag sequence is obtained. Experiments on 55 patent abstracts and 34 patent abstracts in the field of pressure sensor preparation, the simulation results show that the proposed method can improve the recognition accuracy by about 10% compared with the traditional machine learning model (HMM, CRF) in the case of small samples. Compared with the deep learning model (BI-LSTM and BiLSTM+CRF), the accuracy of the model is improved by about 30%, and the accuracy of the model is 93%.","PeriodicalId":360004,"journal":{"name":"2021 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Conference on Telecommunications, Optics and Computer Science (TOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TOCS53301.2021.9688929","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Abstract of patent text, as an important support for intellectual property protection, is an ideal data source for technology mining. Named entity recognition of patent text can reduce the workload of patent analysis, improve work efficiency, and provide effective technical means for patent discovery, patent promotion, patent infringement and other aspects. However, the technical terms of patent texts are difficult to be mined, extracted and labeled. Therefore, this paper proposes a few-shot learning named entity recognition method to solve the problem that the named entity recognition of pressure sensor patent text lacks sufficient annotation data.This method uses MLM (Masked Language Model) pretraining method of BERT Model, selects a small part of token to mask each time, and then repeatedly trains on the same sample, finally obtains the training embedding of bidirectional fusion information on massive continuous corpus. Then the CRF layer is used to decode and finally the prediction tag sequence is obtained. Experiments on 55 patent abstracts and 34 patent abstracts in the field of pressure sensor preparation, the simulation results show that the proposed method can improve the recognition accuracy by about 10% compared with the traditional machine learning model (HMM, CRF) in the case of small samples. Compared with the deep learning model (BI-LSTM and BiLSTM+CRF), the accuracy of the model is improved by about 30%, and the accuracy of the model is 93%.
专利文本摘要作为知识产权保护的重要支撑,是技术挖掘的理想数据源。专利文本的命名实体识别可以减少专利分析的工作量,提高工作效率,为专利发现、专利推广、专利侵权等方面提供有效的技术手段。然而,专利文本的技术术语难以挖掘、提取和标记。因此,本文提出了一种少次学习命名实体识别方法,以解决压力传感器专利文本命名实体识别缺乏足够标注数据的问题。该方法采用BERT模型中的MLM (mask Language Model)预训练方法,每次选取一小部分token进行掩码,然后在同一样本上重复训练,最终得到在海量连续语料库上双向融合信息的训练嵌入。然后利用CRF层进行解码,最后得到预测标签序列。对压力传感器制备领域的55个专利摘要和34个专利摘要进行了实验,仿真结果表明,在小样本情况下,与传统机器学习模型(HMM、CRF)相比,所提方法的识别准确率可提高10%左右。与深度学习模型(BI-LSTM和BiLSTM+CRF)相比,该模型的准确率提高了约30%,模型的准确率达到93%。