Automatic keyword extraction based on dependency parsing and BERT semantic weighting

Third International Seminar on Artificial Intelligence, Networking, and Information Technology Pub Date : 2023-02-22 DOI:10.1117/12.2667242

Huixin Liu

{"title":"Automatic keyword extraction based on dependency parsing and BERT semantic weighting","authors":"Huixin Liu","doi":"10.1117/12.2667242","DOIUrl":null,"url":null,"abstract":"It's hard for the classic TextRank algorithm to differentiate the degree of association between candidate keyword nodes. Furthermore, it readily ignores the long-distance syntactic relations and topic semantic information between words while extracting keywords from a document. For the purpose of solving this problem, we propose an improved TextRank algorithm utilizing lexical, grammatical, and semantic features to find objective keywords from Chinese academic text. Firstly, we construct the word graph of candidate keywords after text preprocessing. Secondly, we integrate multidimensional features of candidate words into the primary calculation of the transition probability matrix. In this regard, our approach mines the full text to extract a collection of grammatical and morphological features (such as part-of-speech, word position, long-distance dependencies, and distinguished BERT dynamic semantic information). By introducing the dependency syntax of long sentences, the algorithm's ability to identify low-frequency topic keywords is obviously promotional. In addition, the external semantic information is designed to be imported through the word embedding model. A merged feature-based matrix is then employed to calculate the influence of all candidate keyword nodes with the iterative formula of PageRank. Namely, we attain a set of satisfactory keywords by ranking candidate nodes according to their comprehensive influence scores and selecting the ultimate top N keywords. This paper utilizes public data sets to verify the effectiveness of the proposed algorithm. Our approach achieves comparable f-scores with a 5.5% improvement (4 keywords) over the classic. The experimental results demonstrate that our approach can expand the degree of association differentiation between nodes better by mining synthetic long text features. Besides, the results also show that the proposed algorithm is more promising and its extraction effect is more robust than previously studied ensemble methods.","PeriodicalId":128051,"journal":{"name":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2667242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

It's hard for the classic TextRank algorithm to differentiate the degree of association between candidate keyword nodes. Furthermore, it readily ignores the long-distance syntactic relations and topic semantic information between words while extracting keywords from a document. For the purpose of solving this problem, we propose an improved TextRank algorithm utilizing lexical, grammatical, and semantic features to find objective keywords from Chinese academic text. Firstly, we construct the word graph of candidate keywords after text preprocessing. Secondly, we integrate multidimensional features of candidate words into the primary calculation of the transition probability matrix. In this regard, our approach mines the full text to extract a collection of grammatical and morphological features (such as part-of-speech, word position, long-distance dependencies, and distinguished BERT dynamic semantic information). By introducing the dependency syntax of long sentences, the algorithm's ability to identify low-frequency topic keywords is obviously promotional. In addition, the external semantic information is designed to be imported through the word embedding model. A merged feature-based matrix is then employed to calculate the influence of all candidate keyword nodes with the iterative formula of PageRank. Namely, we attain a set of satisfactory keywords by ranking candidate nodes according to their comprehensive influence scores and selecting the ultimate top N keywords. This paper utilizes public data sets to verify the effectiveness of the proposed algorithm. Our approach achieves comparable f-scores with a 5.5% improvement (4 keywords) over the classic. The experimental results demonstrate that our approach can expand the degree of association differentiation between nodes better by mining synthetic long text features. Besides, the results also show that the proposed algorithm is more promising and its extraction effect is more robust than previously studied ensemble methods.

查看原文本刊更多论文

基于依赖解析和BERT语义加权的自动关键字提取

经典的TextRank算法很难区分候选关键字节点之间的关联程度。此外，在从文档中提取关键词时，容易忽略词之间的远距离句法关系和主题语义信息。为了解决这一问题，我们提出了一种改进的TextRank算法，利用词汇、语法和语义特征从中文学术文本中寻找客观关键词。首先，对文本进行预处理，构建候选关键词词图;其次，将候选词的多维特征整合到转移概率矩阵的初步计算中;在这方面，我们的方法挖掘全文以提取语法和形态学特征的集合(如词性、词位置、远程依赖关系和区分的BERT动态语义信息)。通过引入长句的依赖句法，该算法对低频主题词的识别能力得到明显提升。此外，还设计了外部语义信息通过词嵌入模型导入。然后利用基于特征的合并矩阵，利用PageRank的迭代公式计算所有候选关键字节点的影响。即根据候选节点的综合影响力得分对其进行排序，并选择最终的前N个关键词，从而获得一组满意的关键词。本文利用公共数据集验证了所提算法的有效性。我们的方法比经典方法获得了5.5%的改进(4个关键字)。实验结果表明，该方法通过挖掘合成长文本特征，可以更好地扩展节点间的关联分化程度。此外，实验结果还表明，该算法具有较好的应用前景，其提取效果比已有的集成方法具有更好的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Third International Seminar on Artificial Intelligence, Networking, and Information Technology

自引率

0.00%

发文量