Heba Mohamed, S. Fathalla, Jens Lehmann, Hajira Jabeen
{"title":"Efficient computation of comprehensive statistical information of large OWL datasets: a scalable approach","authors":"Heba Mohamed, S. Fathalla, Jens Lehmann, Hajira Jabeen","doi":"10.1080/17517575.2022.2062683","DOIUrl":null,"url":null,"abstract":"ABSTRACT Computing dataset statistics is crucial for exploring their structure, however, it becomes challenging for large-scale datasets. This has several key benefits, such as link target identification, vocabulary reuse, quality analysis, big data analytics, and coverage analysis. In this paper, we present the first attempt of developing a distributed approach (OWLStats) for collecting comprehensive statistics over large-scale OWL datasets. OWLStats is a distributed in-memory approach for computing 50 statistical criteria for OWL datasets utilizing Apache Spark. We have successfully integrated OWLStats into the SANSA framework. Experiments results prove that OWLStats is linearly scalable in terms of both node and data scalability.","PeriodicalId":11750,"journal":{"name":"Enterprise Information Systems","volume":" ","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2022-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Enterprise Information Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1080/17517575.2022.2062683","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 1
Abstract
ABSTRACT Computing dataset statistics is crucial for exploring their structure, however, it becomes challenging for large-scale datasets. This has several key benefits, such as link target identification, vocabulary reuse, quality analysis, big data analytics, and coverage analysis. In this paper, we present the first attempt of developing a distributed approach (OWLStats) for collecting comprehensive statistics over large-scale OWL datasets. OWLStats is a distributed in-memory approach for computing 50 statistical criteria for OWL datasets utilizing Apache Spark. We have successfully integrated OWLStats into the SANSA framework. Experiments results prove that OWLStats is linearly scalable in terms of both node and data scalability.
期刊介绍:
Enterprise Information Systems (EIS) focusses on both the technical and applications aspects of EIS technology, and the complex and cross-disciplinary problems of enterprise integration that arise in integrating extended enterprises in a contemporary global supply chain environment. Techniques developed in mathematical science, computer science, manufacturing engineering, and operations management used in the design or operation of EIS will also be considered.