Data IntelligencePub Date : 2022-11-01DOI: 10.1109/ICKG55886.2022.00049
Xindong Wu, Hao Chen, Chenyang Bu, Shengwei Ji, Zan Zhang, Victor S. Sheng
{"title":"HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets","authors":"Xindong Wu, Hao Chen, Chenyang Bu, Shengwei Ji, Zan Zhang, Victor S. Sheng","doi":"10.1109/ICKG55886.2022.00049","DOIUrl":"https://doi.org/10.1109/ICKG55886.2022.00049","url":null,"abstract":"ABSTRACT Spreadsheets contain a lot of valuable data and have many practical applications. The key technology of these practical applications is how to make machines understand the semantic structure of spreadsheets, e.g., identifying cell function types and discovering relationships between cell pairs. Most existing methods for understanding the semantic structure of spreadsheets do not make use of the semantic information of cells. A few studies do, but they ignore the layout structure information of spreadsheets, which affects the performance of cell function classification and the discovery of different relationship types of cell pairs. In this paper, we propose a Heuristic algorithm for Understanding the Semantic Structure of spreadsheets (HUSS). Specifically, for improving the cell function classification, we propose an error correction mechanism (ECM) based on an existing cell function classification model [11] and the layout features of spreadsheets. For improving the table structure analysis, we propose five types of heuristic rules to extract four different types of cell pairs, based on the cell style and spatial location information. Our experimental results on five real-world datasets demonstrate that HUSS can effectively understand the semantic structure of spreadsheets and outperforms corresponding baselines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"537-559"},"PeriodicalIF":3.9,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43443624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingfang Wu, S. Richard, C. Verhey, L. J. Castro, Baptiste Cecconi, N. Juty
{"title":"An Analysis of Crosswalks from Research Data Schemas to Schema.org","authors":"Mingfang Wu, S. Richard, C. Verhey, L. J. Castro, Baptiste Cecconi, N. Juty","doi":"10.1162/dint_a_00186","DOIUrl":"https://doi.org/10.1162/dint_a_00186","url":null,"abstract":"ABSTRACT The increased number of data repositories has greatly increased the availability of open data. To enable broad discovery and access to research dataset, some data repositories have begun leveraging the web architecture by embedding structured metadata markup in dataset web landing pages using vocabularies from Schema.org and extensions. This paper aims to examine metadata interoperability for supporting global data discovery. Specifically, the paper reports a survey on which metadata schema has been adopted by participating data repositories, and presents an analysis of crosswalks from fourteen research data schemas to Schema.org. The analysis indicates most descriptive metadata are interoperable among the schemas, the most inconsistent mapping is the rights metadata, and a large gap exists in the structural metadata and controlled vocabularies to specify various property values. The analysis and collated crosswalks can serve as a reference for data repositories when they develop crosswalks from their own schemas to Schema.org, and provide the research data community a benchmark of structured metadata implementation.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"100-121"},"PeriodicalIF":3.9,"publicationDate":"2022-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49610991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FAIR Equivalency in Indonesia's Digital Health Framework","authors":"Putu Hadi Purnama Jati","doi":"10.1162/dint_a_00171","DOIUrl":"https://doi.org/10.1162/dint_a_00171","url":null,"abstract":"Abstract The objective of this study was to assess the regulatory framework for health data in Indonesia in order to understand the policy context and explore the possibility of expanding the adoption and implementation of the FAIR Guidelines, which state that data should be Findable, Accessible, Interoperable and Reusable (FAIR), in Indonesia. Although the FAIR Guidelines were not explicitly mentioned in any of the policy documents relevant to the Indonesian digital health sector, six out of the eight documents analysed contained FAIR Equivalent principles. In particular, Indonesia's Population Identification Number (NIK) has the potential, as a unique identifier, to support the integration and interoperability (findability) of data, which is crucial to all other aspects of the FAIR Guidelines. There is also a plan to build standards and protocols into the implementation of information systems in each ministry and government agency to improve data accessibility (accessibility), the integration of the various information systems is planned/ongoing (interoperability), and the need for a standardised arrangement for health information systems related to health data following the community standard is recognised (reusability). The documents at the core of Indonesia's digital health/eHealth policy have the highest FAIR Equivalency Score (FE-Score), showing some degree of alignment between the Indonesian digital health implementation vision and the FAIR Guidelines. This indicates that Indonesia's digital health sector is open to using the FAIR Guidelines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"798-812"},"PeriodicalIF":3.9,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64532083","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. d’Aquin, Fabian Kirstein, Daniela Oliveira, Sonja Schimmler, Sebastian Urbanek
{"title":"FAIREST: A Framework for Assessing Research Repositories","authors":"M. d’Aquin, Fabian Kirstein, Daniela Oliveira, Sonja Schimmler, Sebastian Urbanek","doi":"10.1162/dint_a_00159","DOIUrl":"https://doi.org/10.1162/dint_a_00159","url":null,"abstract":"ABSTRACT The open science movement has gained significant momentum within the last few years. This comes along with the need to store and share research artefacts, such as publications and research data. For this purpose, research repositories need to be established. A variety of solutions exist for implementing such repositories, covering diverse features, ranging from custom depositing workflows to social media-like functions. In this article, we introduce the FAIREST principles, a framework inspired by the well-known FAIR principles, but designed to provide a set of metrics for assessing and selecting solutions for creating digital repositories for research artefacts. The goal is to support decision makers in choosing such a solution when planning for a repository, especially at an institutional level. The metrics included are therefore based on two pillars: (1) an analysis of established features and functionalities, drawn from existing dedicated, general purpose and commonly used solutions, and (2) a literature review on general requirements for digital repositories for research artefacts and related systems. We further describe an assessment of 11 widespread solutions, with the goal to provide an overview of the current landscape of research data repository solutions, identifying gaps and research challenges to be addressed.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"202-241"},"PeriodicalIF":3.9,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47618941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ousamma Mohammed Benhamed, K. Burger, R. Kaliyaperumal, Luiz Olavo Bonino da Silva Santos, M. Suchánek, Jan Slifka, Mark D. Wilkinson
{"title":"The FAIR Data Point: Interfaces and Tooling","authors":"Ousamma Mohammed Benhamed, K. Burger, R. Kaliyaperumal, Luiz Olavo Bonino da Silva Santos, M. Suchánek, Jan Slifka, Mark D. Wilkinson","doi":"10.1162/dint_a_00161","DOIUrl":"https://doi.org/10.1162/dint_a_00161","url":null,"abstract":"ABSTRACT While the FAIR Principles do not specify a technical solution for ‘FAIRness’, it was clear from the outset of the FAIR initiative that it would be useful to have commodity software and tooling that would simplify the creation of FAIR-compliant resources. The FAIR Data Point is a metadata repository that follows the DCAT(2) schema, and utilizes the Linked Data Platform to manage the hierarchical metadata layers as LDP Containers. There has been a recent flurry of development activity around the FAIR Data Point that has significantly improved its power and ease-of-use. Here we describe five specific tools—an installer, a loader, two Web-based interfaces, and an indexer—aimed at maximizing the uptake and utility of the FAIR Data Point.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"184-201"},"PeriodicalIF":3.9,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49453130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I. Ivánová, R. Keenan, Christopher Marshall, Lori Mancell, E. Rubinov, R. Ruddick, Nicholas Brown, Graeme Kernich
{"title":"FAIR data and metadata: GNSS precise positioning user perspective","authors":"I. Ivánová, R. Keenan, Christopher Marshall, Lori Mancell, E. Rubinov, R. Ruddick, Nicholas Brown, Graeme Kernich","doi":"10.1162/dint_a_00185","DOIUrl":"https://doi.org/10.1162/dint_a_00185","url":null,"abstract":"ABSTRACT The FAIR principles of Wilkinson et al. [1] are finding their way from research into application domains, one of which is the precise positioning with global satellite navigation systems (GNSS). Current GNSS users demand that data and services are findable online, accessible via open protocols (by both, machines and humans), interoperable with their legacy systems and reusable in various settings. Comprehensive metadata are essential in seamless communication between GNSS data and service providers and their users, and, for decades, geodetic and geospatial standards are efficiently implemented to support this. However, GNSS user community is transforming from precise positioning by highly specialised use by geodetic professionals to every-day precise positioning by autonomous vehicles or wellness obsessed citizens. Moreover, rapid technological developments allow alternative ways of offering data and services to their users. These transforming circumstances warrant a review whether metadata defined in generic geospatial and geodetic standards in use still support FAIR use of modern GNSS data and services across its novel user spectrum. This paper reports the results of current GNSS users’ requirements in various application sectors on the way data, metadata and services are provided. We engaged with GNSS stakeholders to validate our findings and to gain understanding on their perception of the FAIR principles. Our results confirm that offering FAIR GNSS data and services is fundamental, but for a confident use of these, there is a need to review the way metadata are offered to the community. Defining standard compliant GNSS community metadata profile and providing relevant metadata with data on-demand, the approach outlined in this paper, is a way to manage current GNSS users’ expectations and the way to improve FAIR GNSS data and service delivery for both humans and the machines.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"43-74"},"PeriodicalIF":3.9,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45440698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingfang Wu, Hans Brandhorst, M. Marinescu, J. M. López, Marjorie M. K. Hlava, J. Busch
{"title":"Automated metadata annotation: What is and is not possible with machine learning","authors":"Mingfang Wu, Hans Brandhorst, M. Marinescu, J. M. López, Marjorie M. K. Hlava, J. Busch","doi":"10.1162/dint_a_00162","DOIUrl":"https://doi.org/10.1162/dint_a_00162","url":null,"abstract":"ABSTRACT Automated metadata annotation is only as good as training dataset, or rules that are available for the domain. It's important to learn what type of data content a pre-trained machine learning algorithm has been trained on to understand its limitations and potential biases. Consider what type of content is readily available to train an algorithm—what's popular and what's available. However, scholarly and historical content is often not available in consumable, homogenized, and interoperable formats at the large volume that is required for machine learning. There are exceptions such as science and medicine, where large, well documented collections are available. This paper presents the current state of automated metadata annotation in cultural heritage and research data, discusses challenges identified from use cases, and proposes solutions.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"5 1","pages":"122-138"},"PeriodicalIF":3.9,"publicationDate":"2022-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47658196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruduan Plug, Yan Liang, Aliya Aktau, Mariam Basajja, Francisca Onaolapo Oladipo, M. van Reisen
{"title":"Terminology for a FAIR Framework for the Virus Outbreak Data Network-Africa","authors":"Ruduan Plug, Yan Liang, Aliya Aktau, Mariam Basajja, Francisca Onaolapo Oladipo, M. van Reisen","doi":"10.1162/dint_a_00167","DOIUrl":"https://doi.org/10.1162/dint_a_00167","url":null,"abstract":"Abstract The field of health data management poses unique challenges in relation to data ownership, the privacy of data subjects, and the reusability of data. The FAIR Guidelines have been developed to address these challenges. The Virus Outbreak Data Network (VODAN) architecture builds on these principles, using the European Union's General Data Protection Regulation (GDPR) framework to ensure compliance with local data regulations, while using information knowledge management concepts to further improve data provenance and interoperability. In this article we provide an overview of the terminology used in the field of FAIR data management, with a specific focus on FAIR compliant health information management, as implemented in the VODAN architecture.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"698-723"},"PeriodicalIF":3.9,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47351158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sakinat Folorunso, E. Ogundepo, Mariam Basajja, Joseph Awotunde, A. Kawu, Francisca Onaolapo Oladipo, Ibrahim Abdullahi
{"title":"FAIR Machine Learning Model Pipeline Implementation of COVID-19 Data","authors":"Sakinat Folorunso, E. Ogundepo, Mariam Basajja, Joseph Awotunde, A. Kawu, Francisca Onaolapo Oladipo, Ibrahim Abdullahi","doi":"10.1162/dint_a_00182","DOIUrl":"https://doi.org/10.1162/dint_a_00182","url":null,"abstract":"Abstract Research and development are gradually becoming data-driven and the implementation of the FAIR Guidelines (that data should be Findable, Accessible, Interoperable, and Reusable) for scientific data administration and stewardship has the potential to remarkably enhance the framework for the reuse of research data. In this way, FAIR is aiding digital transformation. The ‘FAIRification’ of data increases the interoperability and (re)usability of data, so that new and robust analytical tools, such as machine learning (ML) models, can access the data to deduce meaningful insights, extract actionable information, and identify hidden patterns. This article aims to build a FAIR ML model pipeline using the generic FAIRification workflow to make the whole ML analytics process FAIR. Accordingly, FAIR input data was modelled using a FAIR ML model. The output data from the FAIR ML model was also made FAIR. For this, a hybrid hierarchical k-means (HHK) clustering ML algorithm was applied to group the data into homogeneous subgroups and ascertain the underlying structure of the data using a Nigerian-based FAIR dataset that contains data on economic factors, healthcare facilities, and coronavirus occurrences in all the 36 states of Nigeria. The model showed that research data and the ML pipeline can be FAIRified, shared, and reused by following the proposed FAIRification workflow and implementing technical architecture.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"971-990"},"PeriodicalIF":3.9,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45554526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisca Onaolapo Oladipo, Sakinat Folorunso, E. Ogundepo, Obinna Osigwe, A. Akindele
{"title":"Curriculum Development for FAIR Data Stewardship","authors":"Francisca Onaolapo Oladipo, Sakinat Folorunso, E. Ogundepo, Obinna Osigwe, A. Akindele","doi":"10.1162/dint_a_00183","DOIUrl":"https://doi.org/10.1162/dint_a_00183","url":null,"abstract":"Abstract The FAIR Guidelines attempts to make digital data Findable, Accessible, Interoperable, and Reusable (FAIR). To prepare FAIR data, a new data science discipline known as data stewardship is emerging and, as the FAIR Guidelines gain more acceptance, an increase in the demand for data stewards is expected. Consequently, there is a need to develop curricula to foster professional skills in data stewardship through effective knowledge communication. There have been a number of initiatives aimed at bridging the gap in FAIR data management training through both formal and informal programmes. This article describes the experience of developing a digital initiative for FAIR data management training under the Digital Innovations and Skills Hub (DISH) project. The FAIR Data Management course offers 6 short on-demand certificate modules over 12 weeks. The modules are divided into two sets: FAIR data and data science. The core subjects cover elementary topics in data science, regulatory frameworks, FAIR data management, intermediate to advanced topics in FAIR Data Point installation, and FAIR data in the management of healthcare and semantic data. Each week, participants are required to devote 7–8 hours of self-study to the modules, based on the resources provided. Once they have satisfied all requirements, students are certified as FAIR data scientists and qualified to serve as both FAIR data stewards and analysts. It is expected that in-depth and focused curricula development with diverse participants will build a core of FAIR data scientists for Data Competence Centres and encourage the rapid adoption of the FAIR Guidelines for research and development.","PeriodicalId":34023,"journal":{"name":"Data Intelligence","volume":"4 1","pages":"991-1012"},"PeriodicalIF":3.9,"publicationDate":"2022-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43285441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}