An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus

IF 2.7 3区 计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Liang-Ching Chen
{"title":"An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus","authors":"Liang-Ching Chen","doi":"10.1016/j.datak.2024.102322","DOIUrl":null,"url":null,"abstract":"<div><p>Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102322"},"PeriodicalIF":2.7000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X24000466","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.

改进传统语料库研究中关键词提取的扩展 TF-IDF 方法:以气候变化语料库为例
关键词提取涉及应用文本挖掘领域开发的自然语言处理(NLP)算法或模型。关键词提取是语料库语言学领域探索语言模式的常用技术,而邓宁对数似然检验(LLT)作为一种基于统计的 NLP 模型,早已被整合到语料库软件中。虽然之前的研究已经证实了关键词提取在基于语料库的研究中的广泛适用性,但 LLT 存在一定的局限性,可能会影响此类研究中关键词提取的准确性。本文总结了 LLT 的局限性,其中包括基准语料干扰、消除语法词和通用词、考虑子语料相关性、特征选择的灵活性以及对不同研究目标的适应性。针对这些局限性,本文提出了一种扩展的词频-反向文档频率(TF-IDF)方法。为了验证该方法的适用性,本文使用了 Web of Science(WOS)数据库中 20 篇关于气候变化的高被引研究文章作为目标语料,并与传统方法进行了比较。实验结果表明,所提出的方法能有效克服传统方法的局限性,并证明了将 TF-IDF 算法纳入基于语料库的相关研究的可行性和实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data & Knowledge Engineering
Data & Knowledge Engineering 工程技术-计算机:人工智能
CiteScore
5.00
自引率
0.00%
发文量
66
审稿时长
6 months
期刊介绍: Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信