{"title":"改进传统语料库研究中关键词提取的扩展 TF-IDF 方法:以气候变化语料库为例","authors":"Liang-Ching Chen","doi":"10.1016/j.datak.2024.102322","DOIUrl":null,"url":null,"abstract":"<div><p>Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.</p></div>","PeriodicalId":55184,"journal":{"name":"Data & Knowledge Engineering","volume":"153 ","pages":"Article 102322"},"PeriodicalIF":2.7000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus\",\"authors\":\"Liang-Ching Chen\",\"doi\":\"10.1016/j.datak.2024.102322\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.</p></div>\",\"PeriodicalId\":55184,\"journal\":{\"name\":\"Data & Knowledge Engineering\",\"volume\":\"153 \",\"pages\":\"Article 102322\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2024-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data & Knowledge Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169023X24000466\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data & Knowledge Engineering","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169023X24000466","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
关键词提取涉及应用文本挖掘领域开发的自然语言处理(NLP)算法或模型。关键词提取是语料库语言学领域探索语言模式的常用技术,而邓宁对数似然检验(LLT)作为一种基于统计的 NLP 模型,早已被整合到语料库软件中。虽然之前的研究已经证实了关键词提取在基于语料库的研究中的广泛适用性,但 LLT 存在一定的局限性,可能会影响此类研究中关键词提取的准确性。本文总结了 LLT 的局限性,其中包括基准语料干扰、消除语法词和通用词、考虑子语料相关性、特征选择的灵活性以及对不同研究目标的适应性。针对这些局限性,本文提出了一种扩展的词频-反向文档频率(TF-IDF)方法。为了验证该方法的适用性,本文使用了 Web of Science(WOS)数据库中 20 篇关于气候变化的高被引研究文章作为目标语料,并与传统方法进行了比较。实验结果表明,所提出的方法能有效克服传统方法的局限性,并证明了将 TF-IDF 算法纳入基于语料库的相关研究的可行性和实用性。
An extended TF-IDF method for improving keyword extraction in traditional corpus-based research: An example of a climate change corpus
Keyword extraction involves the application of Natural Language Processing (NLP) algorithms or models developed in the realm of text mining. Keyword extraction is a common technique used to explore linguistic patterns in the corpus linguistic field, and Dunning’s Log-Likelihood Test (LLT) has long been integrated into corpus software as a statistic-based NLP model. While prior research has confirmed the widespread applicability of keyword extraction in corpus-based research, LLT has certain limitations that may impact the accuracy of keyword extraction in such research. This paper summarized the limitations of LLT, which include benchmark corpus interference, elimination of grammatical and generic words, consideration of sub-corpus relevance, flexibility in feature selection, and adaptability to different research goals. To address these limitations, this paper proposed an extended Term Frequency-Inverse Document Frequency (TF-IDF) method. To verify the applicability of the proposed method, 20 highly cited research articles on climate change from the Web of Science (WOS) database were used as the target corpus, and a comparison was conducted with the traditional method. The experimental results indicated that the proposed method could effectively overcome the limitations of the traditional method and demonstrated the feasibility and practicality of incorporating the TF-IDF algorithm into relevant corpus-based research.
期刊介绍:
Data & Knowledge Engineering (DKE) stimulates the exchange of ideas and interaction between these two related fields of interest. DKE reaches a world-wide audience of researchers, designers, managers and users. The major aim of the journal is to identify, investigate and analyze the underlying principles in the design and effective use of these systems.