用于分层出版物主题分类的自动和基于关联的过程

IF 3.4 2区管理学 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Journal of Informetrics Pub Date : 2023-11-11 DOI:10.1016/j.joi.2023.101466

Cristina Urdiales , Eduardo Guzmán

{"title":"用于分层出版物主题分类的自动和基于关联的过程","authors":"Cristina Urdiales , Eduardo Guzmán","doi":"10.1016/j.joi.2023.101466","DOIUrl":null,"url":null,"abstract":"<div><p>Subject categorization of scientific publications, i.e., journals, book series or conference proceedings, has become a main concern in academia, as publication impact and ranking are considered a basic criterion to evaluate paper quality. Publishers usually propose their own categorization, but they often include only their own publications and their categories might not be coherent with other proposals. Also, due to the dynamic nature of science, new categories may frequently appear. As traditional mechanisms for categorization have been questioned by many authors, a new research line has emerged to improve the category assignment process. Approaches usually rely on assessing publication similarity in terms of topics, co-citation, editorial boards, and/or shared author profiles. In this work, we propose a novel procedure for scientific publication hierarchical categorization based on the repetition or absence of relevant descriptors in association rules among publications. The key idea is that publication categories can be automatically defined by strong associations of nuclear topics. Also, some very specific subcategories can be defined by exclusion from any set of rules. This process can be used to construct a data-driven hierarchy of scientific publication categories from scratch or to improve any existing categorization by discovering new fields. In this paper the proposed algorithm uses SJR descriptors all journals in the SCImago dataset and the three-level classification in the Scopus dataset (covering only 35 % of publications of the SCImago dataset) to discover new categories and assign every journal to the resulting enhanced hierarchy one. We have focused on the field of “Physical Sciences and Engineering”, using the SCImago and Scopus datasets from 2019 (30,883 scientific publications). Our procedure combines data engineering techniques with association rules and generates as a result potential new categories and outlier subcategories. To evaluate the suitability of our proposal, we have analyzed classification results based on the original category list and our extended two-level categorization <em>via</em> the Jensen–Shannon divergence and supervised machine-learning techniques. Results reveal the consistency and suitability of our categorization procedure.</p></div>","PeriodicalId":48662,"journal":{"name":"Journal of Informetrics","volume":"18 1","pages":"Article 101466"},"PeriodicalIF":3.4000,"publicationDate":"2023-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1751157723000913/pdfft?md5=954e11edccb9294c8beafebe086bf632&pid=1-s2.0-S1751157723000913-main.pdf","citationCount":"0","resultStr":"{\"title\":\"An automatic and association-based procedure for hierarchical publication subject categorization\",\"authors\":\"Cristina Urdiales , Eduardo Guzmán\",\"doi\":\"10.1016/j.joi.2023.101466\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Subject categorization of scientific publications, i.e., journals, book series or conference proceedings, has become a main concern in academia, as publication impact and ranking are considered a basic criterion to evaluate paper quality. Publishers usually propose their own categorization, but they often include only their own publications and their categories might not be coherent with other proposals. Also, due to the dynamic nature of science, new categories may frequently appear. As traditional mechanisms for categorization have been questioned by many authors, a new research line has emerged to improve the category assignment process. Approaches usually rely on assessing publication similarity in terms of topics, co-citation, editorial boards, and/or shared author profiles. In this work, we propose a novel procedure for scientific publication hierarchical categorization based on the repetition or absence of relevant descriptors in association rules among publications. The key idea is that publication categories can be automatically defined by strong associations of nuclear topics. Also, some very specific subcategories can be defined by exclusion from any set of rules. This process can be used to construct a data-driven hierarchy of scientific publication categories from scratch or to improve any existing categorization by discovering new fields. In this paper the proposed algorithm uses SJR descriptors all journals in the SCImago dataset and the three-level classification in the Scopus dataset (covering only 35 % of publications of the SCImago dataset) to discover new categories and assign every journal to the resulting enhanced hierarchy one. We have focused on the field of “Physical Sciences and Engineering”, using the SCImago and Scopus datasets from 2019 (30,883 scientific publications). Our procedure combines data engineering techniques with association rules and generates as a result potential new categories and outlier subcategories. To evaluate the suitability of our proposal, we have analyzed classification results based on the original category list and our extended two-level categorization <em>via</em> the Jensen–Shannon divergence and supervised machine-learning techniques. Results reveal the consistency and suitability of our categorization procedure.</p></div>\",\"PeriodicalId\":48662,\"journal\":{\"name\":\"Journal of Informetrics\",\"volume\":\"18 1\",\"pages\":\"Article 101466\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2023-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1751157723000913/pdfft?md5=954e11edccb9294c8beafebe086bf632&pid=1-s2.0-S1751157723000913-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Informetrics\",\"FirstCategoryId\":\"91\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1751157723000913\",\"RegionNum\":2,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Informetrics","FirstCategoryId":"91","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1751157723000913","RegionNum":2,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

摘要

科学出版物(即期刊、丛书或会议论文集)的主题分类已成为学术界关注的主要问题，因为出版影响和排名被认为是评估论文质量的基本标准。出版商通常会提出他们自己的分类，但是他们通常只包括他们自己的出版物，他们的分类可能与其他建议不一致。此外，由于科学的动态性，新的范畴可能会频繁出现。由于传统的分类机制受到许多作者的质疑，一个新的研究方向出现了，以改善类别分配过程。方法通常依赖于评估出版物在主题、共同引用、编辑委员会和/或共享作者简介方面的相似性。在这项工作中，我们提出了一种新的基于关联规则中相关描述符的重复或缺失的科学出版物分层分类方法。其关键思想是出版物类别可以根据核心主题的强关联自动定义。此外，可以通过排除任何规则集来定义一些非常特定的子类别。这个过程可以用来从头开始构建一个数据驱动的科学出版物类别层次结构，或者通过发现新的领域来改进任何现有的分类。本文提出的算法使用SJR描述符在SCImago数据集中的所有期刊和Scopus数据集中的三级分类(仅覆盖SCImago数据集中35%的出版物)来发现新的分类，并将每个期刊分配到由此产生的增强层次分类中。我们专注于“物理科学与工程”领域，使用了2019年的SCImago和Scopus数据集(30,883篇科学出版物)。我们的过程将数据工程技术与关联规则相结合，从而生成潜在的新类别和离群子类别。为了评估我们的建议的适用性，我们分析了基于原始类别列表和我们通过Jensen-Shannon分歧和监督机器学习技术扩展的两级分类的分类结果。结果显示了我们的分类程序的一致性和适用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An automatic and association-based procedure for hierarchical publication subject categorization

Subject categorization of scientific publications, i.e., journals, book series or conference proceedings, has become a main concern in academia, as publication impact and ranking are considered a basic criterion to evaluate paper quality. Publishers usually propose their own categorization, but they often include only their own publications and their categories might not be coherent with other proposals. Also, due to the dynamic nature of science, new categories may frequently appear. As traditional mechanisms for categorization have been questioned by many authors, a new research line has emerged to improve the category assignment process. Approaches usually rely on assessing publication similarity in terms of topics, co-citation, editorial boards, and/or shared author profiles. In this work, we propose a novel procedure for scientific publication hierarchical categorization based on the repetition or absence of relevant descriptors in association rules among publications. The key idea is that publication categories can be automatically defined by strong associations of nuclear topics. Also, some very specific subcategories can be defined by exclusion from any set of rules. This process can be used to construct a data-driven hierarchy of scientific publication categories from scratch or to improve any existing categorization by discovering new fields. In this paper the proposed algorithm uses SJR descriptors all journals in the SCImago dataset and the three-level classification in the Scopus dataset (covering only 35 % of publications of the SCImago dataset) to discover new categories and assign every journal to the resulting enhanced hierarchy one. We have focused on the field of “Physical Sciences and Engineering”, using the SCImago and Scopus datasets from 2019 (30,883 scientific publications). Our procedure combines data engineering techniques with association rules and generates as a result potential new categories and outlier subcategories. To evaluate the suitability of our proposal, we have analyzed classification results based on the original category list and our extended two-level categorization via the Jensen–Shannon divergence and supervised machine-learning techniques. Results reveal the consistency and suitability of our categorization procedure.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Informetrics Social Sciences-Library and Information Sciences

CiteScore

6.40

自引率

16.20%

发文量

期刊介绍： Journal of Informetrics (JOI) publishes rigorous high-quality research on quantitative aspects of information science. The main focus of the journal is on topics in bibliometrics, scientometrics, webometrics, patentometrics, altmetrics and research evaluation. Contributions studying informetric problems using methods from other quantitative fields, such as mathematics, statistics, computer science, economics and econometrics, and network science, are especially encouraged. JOI publishes both theoretical and empirical work. In general, case studies, for instance a bibliometric analysis focusing on a specific research field or a specific country, are not considered suitable for publication in JOI, unless they contain innovative methodological elements.