A Construction Method of Electric Power Professional Domain Corpus Based on Multi-model Collaboration

Ke Zhang, Wengan Chen, Hongtao Zai, Hongying He, XiHao Yin, Diansheng Luo
{"title":"A Construction Method of Electric Power Professional Domain Corpus Based on Multi-model Collaboration","authors":"Ke Zhang, Wengan Chen, Hongtao Zai, Hongying He, XiHao Yin, Diansheng Luo","doi":"10.1109/AEEES54426.2022.9759620","DOIUrl":null,"url":null,"abstract":"This paper proposes a method for constructing a corpus in the electric power field based on multi-method collaboration. Aiming at avoid the disadvantage of the excessive small granularity of the words segmentation results by Jieba word segmentation method which cause the incorrectly split of the words, TF-IDF method is used to extract keywords from the Jieba word segmentation results. An improved information entropy word combination method and TextRank method are applied to make word associations from the Jieba word segmentation results to form new phrases. For the information entropy word segmentation method uses strict phrases forming rule, which may cause the number of words formed decrease, all the word segmentation results of the above methods are collected to establish a relatively complete set of candidate words. Then, an improved word2vec clustering algorithm is presented to cluster electric power professional words and remove non-electric power words. Through the above-multi-method collaborative algorithm, a more comprehensive electric power professional field corpus is finally established. Compared with the Jieba word segmentation method, information entropy word combination algorithm (IEWCA), information entropy word segmentation algorithm (IEWSA), the experimental results prove that the electric power professional field corpus constructed by the presented method in this paper is more accurate and with richer vocabulary.","PeriodicalId":252797,"journal":{"name":"2022 4th Asia Energy and Electrical Engineering Symposium (AEEES)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 4th Asia Energy and Electrical Engineering Symposium (AEEES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AEEES54426.2022.9759620","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This paper proposes a method for constructing a corpus in the electric power field based on multi-method collaboration. Aiming at avoid the disadvantage of the excessive small granularity of the words segmentation results by Jieba word segmentation method which cause the incorrectly split of the words, TF-IDF method is used to extract keywords from the Jieba word segmentation results. An improved information entropy word combination method and TextRank method are applied to make word associations from the Jieba word segmentation results to form new phrases. For the information entropy word segmentation method uses strict phrases forming rule, which may cause the number of words formed decrease, all the word segmentation results of the above methods are collected to establish a relatively complete set of candidate words. Then, an improved word2vec clustering algorithm is presented to cluster electric power professional words and remove non-electric power words. Through the above-multi-method collaborative algorithm, a more comprehensive electric power professional field corpus is finally established. Compared with the Jieba word segmentation method, information entropy word combination algorithm (IEWCA), information entropy word segmentation algorithm (IEWSA), the experimental results prove that the electric power professional field corpus constructed by the presented method in this paper is more accurate and with richer vocabulary.
基于多模型协作的电力专业领域语料库构建方法
提出了一种基于多方法协作的电力领域语料库构建方法。针对Jieba分词方法分词结果粒度过小,导致分词不正确的缺点,采用TF-IDF方法从Jieba分词结果中提取关键词。采用改进的信息熵词组合方法和TextRank方法对解巴分词结果进行词关联,形成新短语。由于信息熵分词方法使用严格的短语形成规则,可能导致形成的词数减少,因此将上述方法的所有分词结果收集起来,建立相对完整的候选词集。然后,提出了一种改进的word2vec聚类算法,对电力专业词进行聚类,剔除非电力专业词。通过以上多方法协同算法,最终建立了较为全面的电力专业领域语料库。实验结果表明,与Jieba分词方法、信息熵词组合算法(IEWCA)、信息熵分词算法(IEWSA)相比,本文方法构建的电力专业领域语料库更准确,词汇量更丰富。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信