A clustering approach for data quality results of research information systems

IF 2.6 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE

Information Discovery and Delivery Pub Date : 2022-11-03 DOI:10.1108/idd-07-2022-0063

Reza Edris Abadi, M. Ershadi, S. T. A. Niaki

{"title":"A clustering approach for data quality results of research information systems","authors":"Reza Edris Abadi, M. Ershadi, S. T. A. Niaki","doi":"10.1108/idd-07-2022-0063","DOIUrl":null,"url":null,"abstract":"\nPurpose\nThe overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.\n\n\nDesign/methodology/approach\nClustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.\n\n\nFindings\nThis paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.\n\n\nResearch limitations/implications\nIn the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.\n\n\nOriginality/value\nAlthough several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.\n","PeriodicalId":43488,"journal":{"name":"Information Discovery and Delivery","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Discovery and Delivery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/idd-07-2022-0063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 1

Abstract

Purpose The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems. Design/methodology/approach Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system. Findings This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method. Research limitations/implications In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods. Originality/value Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.

查看原文本刊更多论文

研究信息系统数据质量结果的聚类方法

目的数据挖掘过程的总体目标是从广泛的数据集中提取信息，并使其易于理解以供进一步使用。在研究信息系统中处理大量非结构化数据时，有必要在检查信息质量后将其划分为逻辑分组，然后再进行分析。另一方面，数据质量结果是定义任何信息系统的卓越质量计划的宝贵资源。因此，本研究的目的是发现和提取知识，以评估和提高研究信息系统中的数据质量。设计/方法论/方法数据分析中的聚类和利用输出使从业者能够深入而广泛地查看他们的信息，并根据他们的发现形成一些逻辑结构。在这项研究中，从信息系统中提取的数据被用于第一阶段。然后，基于数据质量维度标准将数据质量结果分类到有组织的结构中。接下来，应用聚类算法（K-Means）、基于密度的聚类（具有噪声的应用程序的基于密度的空间聚类[DBSCAN]）和层次聚类（使用层次结构的平衡迭代约简和聚类[BICH]）来比较和找到研究信息系统中最合适的聚类算法。研究结果表明，信息系统的质量控制结果可以通过众所周知的数据质量维度进行分类，包括准确性、准确性、完整性、一致性、信誉和及时性。此外，在不同的已知聚类方法中，层次聚类方法的BIRCH算法在数据聚类中表现更好，并且给出了最高的剪影系数值。接下来是DBSCAN方法，它的性能比K-Means方法好。研究局限性/含义在数据质量评估过程中，发现的差异和对不一致数据缺乏适当分类导致了非结构化报告，使定性元数据问题的统计分析变得困难，因此无法根除观察到的错误。因此，在本研究中，数据质量的评估结果被归类为不同的数据质量维度，在此基础上，以数据挖掘方法的形式进行了多重分析。原创性/价值尽管已经进行了几项研究来评估研究信息系统的数据质量结果，但从所获得的数据质量分数中提取知识是一项文献中很少研究的关键工作。此外，数据质量分析中的聚类和利用输出使从业者能够深入而广泛地查看他们的信息，并根据他们的发现形成一些逻辑结构。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information Discovery and Delivery INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

5.40

自引率

4.80%

发文量

期刊介绍： Information Discovery and Delivery covers information discovery and access for digital information researchers. This includes educators, knowledge professionals in education and cultural organisations, knowledge managers in media, health care and government, as well as librarians. The journal publishes research and practice which explores the digital information supply chain ie transport, flows, tracking, exchange and sharing, including within and between libraries. It is also interested in digital information capture, packaging and storage by ‘collectors’ of all kinds. Information is widely defined, including but not limited to: Records, Documents, Learning objects, Visual and sound files, Data and metadata and , User-generated content.