Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Journal of data and information science (Warsaw, Poland) Pub Date : 2021-06-01 DOI:10.2478/jdis-2021-0024

Sahand Vahidnia, A. Abbasi, H. Abbass

{"title":"Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering","authors":"Sahand Vahidnia, A. Abbasi, H. Abbass","doi":"10.2478/jdis-2021-0024","DOIUrl":null,"url":null,"abstract":"Abstract Purpose Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem. Design/methodology/approach To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications. Findings Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics. Research limitations We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited. Practical implications As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics. Originality/value In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"6 1","pages":"99 - 122"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2021-0024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Abstract Purpose Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem. Design/methodology/approach To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications. Findings Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics. Research limitations We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited. Practical implications As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics. Originality/value In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

查看原文本刊更多论文

基于深度聚类的学术文献研究主题的嵌入检测与提取

摘要目的检测研究领域或主题，了解其动态，有助于科学界对科学领域的建立做出决策。这也有助于与政府和企业进行更好的合作。本研究旨在考察研究领域随时间的发展，并将其转化为主题检测问题。为了实现这一目标，我们提出了一种改进的深度聚类方法，从学术文献的摘要和标题中检测研究趋势。文档嵌入方法用于将文档转换为基于向量的表示。通过对基准数据集与不同嵌入和聚类方法的组合以及经典主题建模算法(即LDA)进行比较，对所提出的方法进行了评估。案例研究还探讨了人工智能(AI)的演变，检测了相关AI出版物中的研究主题或子领域。使用聚类性能指标评估所提出方法的性能反映了我们提出的方法优于针对基准数据集的类似方法。利用该方法，我们还展示了近30年来主题的演变，利用关键字提取方法进行聚类标记和标注，展示了主题的上下文。我们注意到，不可能将一个解决方案推广到所有下游任务。因此，需要对每个任务甚至数据集的解决方案进行微调或优化。此外，对聚类标签的解释可能是主观的，并根据读者的意见而有所不同。对标记技术的评价也非常困难，使得对聚类的解释进一步受到限制。在案例研究中，我们展示了在一个现实世界的例子中，所提出的方法如何使学术研究的研究人员和审稿人能够从数十年的学术文献中检测、总结、分析和可视化研究主题。通过建立和解释主题，这有助于科学界和所有相关组织对领域进行快速有效的分析。在本研究中，我们引入了一种改进和调整的深度嵌入聚类，结合Doc2Vec表示进行主题提取。在本研究中，我们还使用概念提取方法作为标记方法。在人工智能出版物的案例研究中，我们分析了过去三十年中的人工智能主题，对该方法的有效性进行了评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of data and information science (Warsaw, Poland)

自引率

0.00%

发文量