{"title":"Scalable Analysis of Open Data Graphs","authors":"Andrei Stoica, Michael Valdron, K. Pu","doi":"10.1109/IRI.2019.00059","DOIUrl":null,"url":null,"abstract":"We have studied Open Data as a connected graph. Each data package is considered a vertex, and we studied the similarity graph induced by several different similarity measures. We analyzed the resulting similarity graph using different metrics to estimate its quality and informativeness. In order to cope with the size of the open data graph (over 6 billion edges), the graph constructions and analysis are done using a distributed computation framework, Apache Spark. The algorithms were implemented using the Spark resilient distributed data algebra, and executed on the Google Cloud Platform (GCP).","PeriodicalId":295028,"journal":{"name":"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2019.00059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We have studied Open Data as a connected graph. Each data package is considered a vertex, and we studied the similarity graph induced by several different similarity measures. We analyzed the resulting similarity graph using different metrics to estimate its quality and informativeness. In order to cope with the size of the open data graph (over 6 billion edges), the graph constructions and analysis are done using a distributed computation framework, Apache Spark. The algorithms were implemented using the Spark resilient distributed data algebra, and executed on the Google Cloud Platform (GCP).