Ke Zhang, Wengan Chen, Hongtao Zai, Hongying He, XiHao Yin, Diansheng Luo
{"title":"A Construction Method of Electric Power Professional Domain Corpus Based on Multi-model Collaboration","authors":"Ke Zhang, Wengan Chen, Hongtao Zai, Hongying He, XiHao Yin, Diansheng Luo","doi":"10.1109/AEEES54426.2022.9759620","DOIUrl":null,"url":null,"abstract":"This paper proposes a method for constructing a corpus in the electric power field based on multi-method collaboration. Aiming at avoid the disadvantage of the excessive small granularity of the words segmentation results by Jieba word segmentation method which cause the incorrectly split of the words, TF-IDF method is used to extract keywords from the Jieba word segmentation results. An improved information entropy word combination method and TextRank method are applied to make word associations from the Jieba word segmentation results to form new phrases. For the information entropy word segmentation method uses strict phrases forming rule, which may cause the number of words formed decrease, all the word segmentation results of the above methods are collected to establish a relatively complete set of candidate words. Then, an improved word2vec clustering algorithm is presented to cluster electric power professional words and remove non-electric power words. Through the above-multi-method collaborative algorithm, a more comprehensive electric power professional field corpus is finally established. Compared with the Jieba word segmentation method, information entropy word combination algorithm (IEWCA), information entropy word segmentation algorithm (IEWSA), the experimental results prove that the electric power professional field corpus constructed by the presented method in this paper is more accurate and with richer vocabulary.","PeriodicalId":252797,"journal":{"name":"2022 4th Asia Energy and Electrical Engineering Symposium (AEEES)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 4th Asia Energy and Electrical Engineering Symposium (AEEES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEEES54426.2022.9759620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper proposes a method for constructing a corpus in the electric power field based on multi-method collaboration. Aiming at avoid the disadvantage of the excessive small granularity of the words segmentation results by Jieba word segmentation method which cause the incorrectly split of the words, TF-IDF method is used to extract keywords from the Jieba word segmentation results. An improved information entropy word combination method and TextRank method are applied to make word associations from the Jieba word segmentation results to form new phrases. For the information entropy word segmentation method uses strict phrases forming rule, which may cause the number of words formed decrease, all the word segmentation results of the above methods are collected to establish a relatively complete set of candidate words. Then, an improved word2vec clustering algorithm is presented to cluster electric power professional words and remove non-electric power words. Through the above-multi-method collaborative algorithm, a more comprehensive electric power professional field corpus is finally established. Compared with the Jieba word segmentation method, information entropy word combination algorithm (IEWCA), information entropy word segmentation algorithm (IEWSA), the experimental results prove that the electric power professional field corpus constructed by the presented method in this paper is more accurate and with richer vocabulary.