Schema-independent scientific data cataloging framework

2015 Moratuwa Engineering Research Conference (MERCon) Pub Date : 2015-04-07 DOI:10.1109/MERCON.2015.7112361

Supun Nakandala, S. Withana, D. Kumarasiri, H. Jayawardena, H. D. Dilum Bandara, S. Perera, S. Marru, Sudhakar Pamidighantam

{"title":"Schema-independent scientific data cataloging framework","authors":"Supun Nakandala, S. Withana, D. Kumarasiri, H. Jayawardena, H. D. Dilum Bandara, S. Perera, S. Marru, Sudhakar Pamidighantam","doi":"10.1109/MERCON.2015.7112361","DOIUrl":null,"url":null,"abstract":"Modern scientific experiments generate vast volumes of data which are hard to keep track of. Consequently, scientists find it difficult to reuse and share these data sets. We address this problem by developing a schema-independent data cataloging framework for efficient management of scientific data. The proposed solution consists of an agent which automatically identifies new data products and extract metadata from them, as well as a server which indexes the metadata using a NoSQL database and provides a REST API for querying, sharing, and reusing the data sets. The novelty of our solution lies in the pluggable metadata extraction logic, extensible data product generation monitors, use of a NoSQL database, and the ability to dynamically add new metadata fields. The use of Apache Solr as the backend database enables the proposed solution to index and search data products much faster than a solution based on relational databases. For example, our Apache Solr based implementation can resolve full text, sub-string, prefix, and suffix queries 91 %-99 % faster than a MySQL-based implementation.","PeriodicalId":373492,"journal":{"name":"2015 Moratuwa Engineering Research Conference (MERCon)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 Moratuwa Engineering Research Conference (MERCon)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MERCON.2015.7112361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Modern scientific experiments generate vast volumes of data which are hard to keep track of. Consequently, scientists find it difficult to reuse and share these data sets. We address this problem by developing a schema-independent data cataloging framework for efficient management of scientific data. The proposed solution consists of an agent which automatically identifies new data products and extract metadata from them, as well as a server which indexes the metadata using a NoSQL database and provides a REST API for querying, sharing, and reusing the data sets. The novelty of our solution lies in the pluggable metadata extraction logic, extensible data product generation monitors, use of a NoSQL database, and the ability to dynamically add new metadata fields. The use of Apache Solr as the backend database enables the proposed solution to index and search data products much faster than a solution based on relational databases. For example, our Apache Solr based implementation can resolve full text, sub-string, prefix, and suffix queries 91 %-99 % faster than a MySQL-based implementation.

查看原文本刊更多论文

独立于模式的科学数据编目框架

现代科学实验产生了大量难以追踪的数据。因此，科学家发现很难重用和共享这些数据集。我们通过开发一个独立于模式的数据编目框架来解决这个问题，从而有效地管理科学数据。提出的解决方案包括一个自动识别新数据产品并从中提取元数据的代理，以及一个使用NoSQL数据库对元数据进行索引并提供用于查询、共享和重用数据集的REST API的服务器。我们的解决方案的新颖之处在于可插拔的元数据提取逻辑、可扩展的数据产品生成监视器、NoSQL数据库的使用以及动态添加新元数据字段的能力。使用Apache Solr作为后端数据库使所提出的解决方案能够比基于关系数据库的解决方案更快地索引和搜索数据产品。例如，我们基于Apache Solr的实现可以解析全文、子字符串、前缀和后缀查询，比基于mysql的实现快91% - 99%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 Moratuwa Engineering Research Conference (MERCon)

自引率

0.00%

发文量