Reviewing technical approaches for sharing and preservation of experimental data

International Symposium on Empirical Software Engineering and Measurement Pub Date : 2014-09-18 DOI:10.1145/2652524.2652600

C. EfraínR.Fonseca, Óscar Dieste Tubío, Natalia Juristo Juzgado, Estefanía Serral, S. Biffl

{"title":"Reviewing technical approaches for sharing and preservation of experimental data","authors":"C. EfraínR.Fonseca, Óscar Dieste Tubío, Natalia Juristo Juzgado, Estefanía Serral, S. Biffl","doi":"10.1145/2652524.2652600","DOIUrl":null,"url":null,"abstract":"Context: Empirical Software Engineering (ESE) replication researchers need to store and manipulate experimental data for several purposes, in particular analysis and reporting. Current research needs call for sharing and preservation of experimental data as well. In a previous work, we analyzed Replication Data Management (RDM) needs. A novel concept, called Experimental Ecosystem, was proposed to solve current deficiencies in RDM approaches. The empirical ecosystem provides replication researchers with a common framework that integrates transparently local heterogeneous data sources. A typical situation where the Empirical Ecosystem is applicable, is when several members of a research group, or several research groups collaborating together, need to share and access each other experimental results. However, to be able to apply the Empirical Ecosystem concept and deliver all promised benefits, it is necessary to analyze the software architectures and tools that can properly support it.\n Goal: Identify the most appropriate technologies for the implementation of the Empirical Ecosystem concept.\n Method: For the purpose of technology identification, four features are particularly relevant: Volume of data, architecture, data semantics and manipulation facilities. Those features were surveyed in repositories and data sharing and preservation tools used in the sciences by means of a systematic literature review.\n Results: 17 sharing and preservation tools reported in the literature were identified. The fields of Genomics and Proteomics, and secondarily Biology, stand out. Given the importance of those disciplines in today's science and economy, it would not be surprising that many other proprietary tools would have gone unnoticed. Regarding repositories, there are hundreds available (either publicly or restricted access) in the Internet. Typically, they aim at benchmarking, or reanalysis and synthesis of existing empirical studies. Most repositories (both in number and importance) belong to the \"hard sciences\" (e.g. biology, physics, etc.), but virtually every research area is represented, including ESE.\n Most tools and repositories use relational databases for data storage, with very little exceptions. When the amount of stored data is very high (e.g. Genomics), relational databases are being substituted by big data management infrastructures such as Apache™ Hadoop®. Relational databases are also used when data are distributed. Global conceptual models guarantee the interoperability among different data sources. When data are heterogeneous, the situation is more complex. Standard conceptual schemas may not be useful, because the semantics of the local data do not necessarily agree the meaning assigned to the global schema. Likewise, large parts of the conceptual schema may not be applicable to local data sources, and the links among local models may not be easily defined. The current trend is abandoning classical conceptual schemas (e.g. entity-relationship) and standardize the vocabulary of the domain using ontologies.\n Manipulation facilities are almost invariably offered using web portals. In some cases, repositories provide web services to give access to data for e-science purposes.\n Conclusions: The review of the technologies used for the implementation of repositories and sharing and preservation tools in the sciences shows that common, well-known technologies (particularly, relational databases) can be used for the implementation of the Empirical Ecosystem concept. The only exception is the semantic integration of local models. Instead of comprehensive, global conceptual schemas, ontologies are being increasingly used for semantic integration.","PeriodicalId":124452,"journal":{"name":"International Symposium on Empirical Software Engineering and Measurement","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Symposium on Empirical Software Engineering and Measurement","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2652524.2652600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Context: Empirical Software Engineering (ESE) replication researchers need to store and manipulate experimental data for several purposes, in particular analysis and reporting. Current research needs call for sharing and preservation of experimental data as well. In a previous work, we analyzed Replication Data Management (RDM) needs. A novel concept, called Experimental Ecosystem, was proposed to solve current deficiencies in RDM approaches. The empirical ecosystem provides replication researchers with a common framework that integrates transparently local heterogeneous data sources. A typical situation where the Empirical Ecosystem is applicable, is when several members of a research group, or several research groups collaborating together, need to share and access each other experimental results. However, to be able to apply the Empirical Ecosystem concept and deliver all promised benefits, it is necessary to analyze the software architectures and tools that can properly support it. Goal: Identify the most appropriate technologies for the implementation of the Empirical Ecosystem concept. Method: For the purpose of technology identification, four features are particularly relevant: Volume of data, architecture, data semantics and manipulation facilities. Those features were surveyed in repositories and data sharing and preservation tools used in the sciences by means of a systematic literature review. Results: 17 sharing and preservation tools reported in the literature were identified. The fields of Genomics and Proteomics, and secondarily Biology, stand out. Given the importance of those disciplines in today's science and economy, it would not be surprising that many other proprietary tools would have gone unnoticed. Regarding repositories, there are hundreds available (either publicly or restricted access) in the Internet. Typically, they aim at benchmarking, or reanalysis and synthesis of existing empirical studies. Most repositories (both in number and importance) belong to the "hard sciences" (e.g. biology, physics, etc.), but virtually every research area is represented, including ESE. Most tools and repositories use relational databases for data storage, with very little exceptions. When the amount of stored data is very high (e.g. Genomics), relational databases are being substituted by big data management infrastructures such as Apache™ Hadoop®. Relational databases are also used when data are distributed. Global conceptual models guarantee the interoperability among different data sources. When data are heterogeneous, the situation is more complex. Standard conceptual schemas may not be useful, because the semantics of the local data do not necessarily agree the meaning assigned to the global schema. Likewise, large parts of the conceptual schema may not be applicable to local data sources, and the links among local models may not be easily defined. The current trend is abandoning classical conceptual schemas (e.g. entity-relationship) and standardize the vocabulary of the domain using ontologies. Manipulation facilities are almost invariably offered using web portals. In some cases, repositories provide web services to give access to data for e-science purposes. Conclusions: The review of the technologies used for the implementation of repositories and sharing and preservation tools in the sciences shows that common, well-known technologies (particularly, relational databases) can be used for the implementation of the Empirical Ecosystem concept. The only exception is the semantic integration of local models. Instead of comprehensive, global conceptual schemas, ontologies are being increasingly used for semantic integration.

查看原文本刊更多论文

审查共享和保存实验数据的技术方法

背景:经验软件工程(ESE)复制研究人员需要为几个目的存储和操作实验数据，特别是分析和报告。当前的研究也需要实验数据的共享和保存。在之前的工作中，我们分析了复制数据管理(RDM)需求。提出了一个新的概念，称为实验生态系统，以解决当前RDM方法的不足。经验生态系统为复制研究人员提供了一个公共框架，该框架集成了透明的本地异构数据源。经验生态系统适用的典型情况是，当一个研究小组的几个成员或几个研究小组合作时，需要共享和访问彼此的实验结果。然而，为了能够应用经验生态系统概念并交付所有承诺的好处，有必要分析能够适当支持它的软件架构和工具。目标:确定实施实证生态系统概念的最合适技术。方法:为了技术识别的目的，有四个特征特别相关:数据量、体系结构、数据语义和操作设施。通过系统的文献综述，对科学中使用的存储库和数据共享和保存工具中的这些特征进行了调查。结果:鉴定了文献报道的17种共享和保存工具。基因组学和蛋白质组学以及次要的生物学领域脱颖而出。考虑到这些学科在当今科学和经济中的重要性，许多其他专有工具被忽视也就不足为奇了。关于存储库，Internet上有数百个可用的存储库(有的是公开访问的，有的是限制访问的)。通常，他们的目标是基准，或重新分析和综合现有的实证研究。大多数知识库(在数量和重要性上)属于“硬科学”(例如生物学，物理学等)，但实际上每个研究领域都有代表，包括ESE。大多数工具和存储库使用关系数据库进行数据存储，很少有例外。当存储的数据量非常大时(例如基因组学)，关系数据库正在被Apache™Hadoop®等大数据管理基础设施所取代。数据分布时也使用关系数据库。全局概念模型保证了不同数据源之间的互操作性。当数据是异构的时候，情况会更加复杂。标准的概念模式可能没有用处，因为局部数据的语义不一定与分配给全局模式的含义一致。同样，概念模式的大部分可能不适用于本地数据源，并且本地模型之间的链接可能不容易定义。目前的趋势是放弃经典的概念模式(例如实体-关系)，并使用本体来标准化领域的词汇表。操作工具几乎总是通过web门户提供。在某些情况下，存储库提供web服务，以便为电子科学目的提供对数据的访问。结论:对科学中用于实现存储库以及共享和保存工具的技术的回顾表明，常见的、众所周知的技术(特别是关系数据库)可以用于实现经验生态系统概念。唯一的例外是本地模型的语义集成。本体越来越多地用于语义集成，而不是全面的全局概念模式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Symposium on Empirical Software Engineering and Measurement

自引率

0.00%

发文量