Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha
{"title":"Collaborative analytics for data silos","authors":"Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha","doi":"10.1109/ICDE.2016.7498286","DOIUrl":null,"url":null,"abstract":"As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"152 1","pages":"743-754"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.