Yrjö Lappalainen, Matti Lassila, Tanja Heikkilä, Jani Nieminen, Tapani Lehtilä
{"title":"Migrating 120,000 Legacy Publications from Several Systems into a Current Research Information System Using Advanced Data Wrangling Techniques","authors":"Yrjö Lappalainen, Matti Lassila, Tanja Heikkilä, Jani Nieminen, Tapani Lehtilä","doi":"10.3390/publications11040049","DOIUrl":null,"url":null,"abstract":"This article describes a complex CRIS (current research information system) implementation project involving the migration of around 120,000 legacy publication records from three different systems. The project, undertaken by Tampere University, encountered several challenges in data diversity, data quality, and resource allocation. To handle the extensive and heterogenous dataset, innovative approaches such as machine learning techniques and various data wrangling tools were used to process data, correct errors, and merge information from different sources. Despite significant delays and unforeseen obstacles, the project was ultimately successful in achieving its goals. The project served as a valuable learning experience, highlighting the importance of data quality and standardized practices, and the need for dedicated resources in handling complex data migration projects in research organizations. This study stands out for its comprehensive documentation of the data wrangling and migration process, which has been less explored in the context of CRIS literature.","PeriodicalId":37551,"journal":{"name":"Publications","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2023-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Publications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/publications11040049","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0
Abstract
This article describes a complex CRIS (current research information system) implementation project involving the migration of around 120,000 legacy publication records from three different systems. The project, undertaken by Tampere University, encountered several challenges in data diversity, data quality, and resource allocation. To handle the extensive and heterogenous dataset, innovative approaches such as machine learning techniques and various data wrangling tools were used to process data, correct errors, and merge information from different sources. Despite significant delays and unforeseen obstacles, the project was ultimately successful in achieving its goals. The project served as a valuable learning experience, highlighting the importance of data quality and standardized practices, and the need for dedicated resources in handling complex data migration projects in research organizations. This study stands out for its comprehensive documentation of the data wrangling and migration process, which has been less explored in the context of CRIS literature.
PublicationsSocial Sciences-Library and Information Sciences
CiteScore
6.50
自引率
1.90%
发文量
40
审稿时长
11 weeks
期刊介绍:
The scope of Publications includes: Theory and practice of scholarly communication Digitisation and innovations in scholarly publishing technologies Metadata, infrastructure, and linking the scholarly record Publishing policies and editorial/peer-review workflows Financial models for scholarly publishing Copyright, licensing and legal issues in publishing Research integrity and publication ethics Issues and best practices in the publication of non-traditional research outputs (e.g., data, software/code, protocols, data management plans, grant proposals, etc.) Issues in the transition to open access and open science Inclusion and participation of traditionally excluded actors Language issues in publication processes and products Traditional and alternative models of peer review Traditional and alternative means of assessment and evaluation of research and its impact, including bibliometrics and scientometrics The place of research libraries, scholarly societies, funders and others in scholarly communication.