Collaborative analytics for data silos

2016 IEEE 32nd International Conference on Data Engineering (ICDE) Pub Date : 2016-05-16 DOI:10.1109/ICDE.2016.7498286

Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha

{"title":"Collaborative analytics for data silos","authors":"Jinkyu Kim, Heonseok Ha, Byung-Gon Chun, Sungroh Yoon, S. Cha","doi":"10.1109/ICDE.2016.7498286","DOIUrl":null,"url":null,"abstract":"As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.","PeriodicalId":6883,"journal":{"name":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","volume":"152 1","pages":"743-754"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 32nd International Conference on Data Engineering (ICDE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2016.7498286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

As a great deal of data has been accumulated in various disciplines, the need for the integrative analysis of separate but relevant data sources is becoming more important. Combining data sources can provide global insight that is otherwise difficult to obtain from individual sources. Because of privacy, regulations, and other issues, many large-scale data repositories remain closed off from the outside, raising what has been termed the data silo issue. The huge volume of today's big data often leads to computational challenges, adding another layer of complexity to the solution. In this paper, we propose a novel method called collaborative analytics by ensemble learning (CABEL), which attempts to resolve the main hurdles regarding the silo issue: accuracy, privacy, and computational efficiency. CABEL represents the data stored in each silo as a compact aggregate of samples called the silo signature. The compact representation provides computational efficiency and privacy preservation but makes it challenging to produce accurate analytics. To resolve this challenge, we formulate the problem of attribute domain sampling and reconstruction, and propose a solution called the Chebyshev subset. To model collaborative efforts to analyze semantically linked but structurally disconnected databases, CABEL utilizes a new ensemble learning technique termed the weighted bagging of base classifiers. We demonstrate the effectiveness of CABEL by testing with a nationwide health-insurance data set containing approximately 4,182,000,000 records collected from the entire population of an Organisation for Economic Co-operation and Development (OECD) country in 2012. In our binary classification tests, CABEL achieved median recall, precision, and F-measure values of 89%, 64%, and 76%, respectively, although only 0.001-0.00001% of the original data was used for model construction, while maintaining data privacy and computational efficiency.

查看原文本刊更多论文

数据孤岛的协作分析

由于各个学科积累了大量的数据，对独立但相关的数据源进行综合分析的需求变得越来越重要。结合数据源可以提供难以从单个来源获得的全局洞察力。由于隐私、法规和其他问题，许多大型数据存储库仍然与外部封闭，从而引发了所谓的数据孤岛问题。当今庞大的大数据量通常会带来计算挑战，为解决方案增加了另一层复杂性。在本文中，我们提出了一种称为集成学习协作分析(CABEL)的新方法，该方法试图解决有关筒仓问题的主要障碍:准确性、隐私性和计算效率。CABEL将存储在每个筒仓中的数据表示为一个紧凑的样本集合，称为筒仓特征。紧凑的表示提供了计算效率和隐私保护，但使其难以产生准确的分析。为了解决这一挑战，我们提出了属性域采样和重构问题，并提出了一种称为Chebyshev子集的解决方案。为了对分析语义相连但结构不相连的数据库的协作努力进行建模，CABEL采用了一种新的集成学习技术，称为基础分类器的加权装袋。我们通过对2012年从经济合作与发展组织(OECD)国家的全体人口中收集的包含约41.82亿条记录的全国性医疗保险数据集进行测试，证明了CABEL的有效性。在我们的二元分类测试中，尽管在保持数据隐私性和计算效率的前提下，仅使用原始数据的0.001-0.00001%进行模型构建，但CABEL的中位召回率、精度和f测量值分别达到89%、64%和76%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 IEEE 32nd International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量