Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Journal of Biomedical Semantics Pub Date : 2023-11-28 DOI:10.1186/s13326-023-00298-4

Gollam Rabby, Jennifer D'Souza, Allard Oelen, Lucie Dvorackova, Vojtěch Svátek, Sören Auer

{"title":"Impact of COVID-19 research: a study on predicting influential scholarly documents using machine learning and a domain-independent knowledge graph.","authors":"Gollam Rabby, Jennifer D'Souza, Allard Oelen, Lucie Dvorackova, Vojtěch Svátek, Sören Auer","doi":"10.1186/s13326-023-00298-4","DOIUrl":null,"url":null,"abstract":"<p><p>Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.</p>","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"18"},"PeriodicalIF":2.0000,"publicationDate":"2023-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10683290/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-023-00298-4","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Multiple studies have investigated bibliometric features and uncategorized scholarly documents for the influential scholarly document prediction task. In this paper, we describe our work that attempts to go beyond bibliometric metadata to predict influential scholarly documents. Furthermore, this work also examines the influential scholarly document prediction task over categorized scholarly documents. We also introduce a new approach to enhance the document representation method with a domain-independent knowledge graph to find the influential scholarly document using categorized scholarly content. As the input collection, we use the WHO corpus with scholarly documents on the theme of COVID-19. This study examines different document representation methods for machine learning, including TF-IDF, BOW, and embedding-based language models (BERT). The TF-IDF document representation method works better than others. From various machine learning methods tested, logistic regression outperformed the other for scholarly document category classification, and the random forest algorithm obtained the best results for influential scholarly document prediction, with the help of a domain-independent knowledge graph, specifically DBpedia, to enhance the document representation method for predicting influential scholarly documents with categorical scholarly content. In this case, our study combines state-of-the-art machine learning methods with the BOW document representation method. We also enhance the BOW document representation with the direct type (RDF type) and unqualified relation from DBpedia. From this experiment, we did not find any impact of the enhanced document representation for the scholarly document category classification. We found an effect in the influential scholarly document prediction with categorical data.

Abstract Image

查看原文本刊更多论文

COVID-19研究的影响:使用机器学习和领域独立知识图预测有影响力的学术文献的研究。

许多研究调查了文献计量学特征和未分类的学术文献，以进行有影响力的学术文献预测任务。在本文中，我们描述了我们的工作，试图超越文献计量元数据来预测有影响力的学术文献。此外，本工作还研究了对分类学术文献有影响的学术文献预测任务。我们还提出了一种新的方法，利用领域无关的知识图来增强文献表示方法，利用分类的学术内容来寻找有影响力的学术文献。作为输入库，我们使用了世卫组织关于COVID-19主题的学术文献语料库。本研究考察了机器学习的不同文档表示方法，包括TF-IDF、BOW和基于嵌入的语言模型(BERT)。TF-IDF文档表示方法比其他方法效果更好。从测试的各种机器学习方法中，逻辑回归在学术文献类别分类方面表现优于其他方法，随机森林算法在有影响力的学术文献预测方面取得了最好的结果，借助领域无关的知识图，特别是DBpedia，增强了预测具有分类学术内容的有影响力的学术文献的文档表示方法。在这种情况下，我们的研究结合了最先进的机器学习方法和BOW文档表示方法。我们还使用直接类型(RDF类型)和来自DBpedia的不限定关系增强了BOW文档表示。从这个实验中，我们没有发现增强的文档表示对学术文档类别分类有任何影响。我们发现在有影响力的学术文献预测中使用分类数据有一定的效果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.