Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data最新文献

筛选
英文 中文
Frequent Causal Pattern Mining: A Computationally Efficient Framework For Estimating Bias-Corrected Effects. 频繁因果模式挖掘:用于估计偏差校正效果的计算效率框架。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/bigdata47090.2019.9005977
Pranjul Yadav, Pedro J Caraballo, Michael Steinbach, Vipin Kumar, M Regina Castro, Gyorgy Simon
{"title":"Frequent Causal Pattern Mining: A Computationally Efficient Framework For Estimating Bias-Corrected Effects.","authors":"Pranjul Yadav,&nbsp;Pedro J Caraballo,&nbsp;Michael Steinbach,&nbsp;Vipin Kumar,&nbsp;M Regina Castro,&nbsp;Gyorgy Simon","doi":"10.1109/bigdata47090.2019.9005977","DOIUrl":"https://doi.org/10.1109/bigdata47090.2019.9005977","url":null,"abstract":"<p><p>Our aging population increasingly suffers from multiple chronic diseases simultaneously, necessitating the comprehensive treatment of these conditions. Finding the optimal set of drugs for a combinatorial set of diseases is a combinatorial pattern exploration problem. Association rule mining is a popular tool for such problems, but the requirement of health care for finding causal, rather than associative, patterns renders association rule mining unsuitable. To address this issue, we propose a novel framework based on the Rubin-Neyman causal model for extracting causal rules from observational data, correcting for a number of common biases. Specifically, given a set of interventions and a set of items that define subpopulations (e.g., diseases), we wish to find all subpopulations in which effective intervention combinations exist and in each such subpopulation, we wish to find all intervention combinations such that dropping any intervention from this combination will reduce the efficacy of the treatment. A key aspect of our framework is the concept of closed intervention sets which extend the concept of quantifying the effect of a single intervention to a set of concurrent interventions. Closed intervention sets also allow for a pruning strategy that is strictly more efficient than the traditional pruning strategy used by the Apriori algorithm. To implement our ideas, we introduce and compare five methods of estimating causal effect from observational data and rigorously evaluate them on synthetic data to mathematically prove (when possible) why they work. We also evaluated our causal rule mining framework on the Electronic Health Records (EHR) data of a large cohort of 152000 patients from Mayo Clinic and showed that the patterns we extracted are sufficiently rich to explain the controversial findings in the medical literature regarding the effect of a class of cholesterol drugs on Type-II Diabetes Mellitus (T2DM).</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"1981-1990"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/bigdata47090.2019.9005977","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38705422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Coarse Graining of Data via Inhomogeneous Diffusion Condensation. 通过非均质扩散凝结对数据进行粗粒化。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/BigData47090.2019.9006013
Nathan Brugnone, Alex Gonopolskiy, Mark W Moyle, Manik Kuchroo, David van Dijk, Kevin R Moon, Daniel Colon-Ramos, Guy Wolf, Matthew J Hirn, Smita Krishnaswamy
{"title":"Coarse Graining of Data via Inhomogeneous Diffusion Condensation.","authors":"Nathan Brugnone, Alex Gonopolskiy, Mark W Moyle, Manik Kuchroo, David van Dijk, Kevin R Moon, Daniel Colon-Ramos, Guy Wolf, Matthew J Hirn, Smita Krishnaswamy","doi":"10.1109/BigData47090.2019.9006013","DOIUrl":"10.1109/BigData47090.2019.9006013","url":null,"abstract":"<p><p>Big data often has emergent structure that exists at multiple levels of abstraction, which are useful for characterizing complex interactions and dynamics of the observations. Here, we consider multiple levels of abstraction via a multiresolution geometry of data points at different granularities. To construct this geometry we define a time-inhomogemeous diffusion process that effectively condenses data points together to uncover nested groupings at larger and larger granularities. This inhomogeneous process creates a deep cascade of intrinsic low pass filters on the data affinity graph that are applied in sequence to gradually eliminate local variability while adjusting the learned data geometry to increasingly coarser resolutions. We provide visualizations to exhibit our method as a \"continuously-hierarchical\" clustering with directions of eliminated variation highlighted at each step. The utility of our algorithm is demonstrated via neuronal data condensation, where the constructed multiresolution data geometry uncovers the organization, grouping, and connectivity between neurons.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"2624-2633"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7398322/pdf/nihms-1612101.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38224753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Streaming model for Generalized Rayleigh with extension to Minimum Noise Fraction. 扩展到最小噪声分数的广义瑞利流模型。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2019-12-01 Epub Date: 2020-02-24 DOI: 10.1109/BigData47090.2019.9006512
Soumyajit Gupta, Chandrajit Bajaj
{"title":"A Streaming model for Generalized Rayleigh with extension to Minimum Noise Fraction.","authors":"Soumyajit Gupta,&nbsp;Chandrajit Bajaj","doi":"10.1109/BigData47090.2019.9006512","DOIUrl":"https://doi.org/10.1109/BigData47090.2019.9006512","url":null,"abstract":"<p><p>The Rayleigh quotient optimization is the maximization of a rational function, or a max-min problem, with simultaneous maximization of the numerator function and minimization of the denominator function. Here, we describe a low-rank, streaming solution for Rayleigh quotient optimization applicable for big-data scenarios where the data matrix is too large to be fully loaded into main memory. We apply this for a maximization of the Signal to Noise ratio of big-data, of very large static and dynamic data. Our implementation is shown to achieve faster processing time compared to a standard data read into memory. We demonstrate the trade-offs with synthetic and real data, on different scales to validate the approach in terms of accuracy, speed and storage.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"74-83"},"PeriodicalIF":0.0,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BigData47090.2019.9006512","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37897315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking. 通过候选者排名对假设生成系统进行大规模验证。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/bigdata.2018.8622637
Justin Sybrandt, Michael Shtutman, Ilya Safro
{"title":"Large-Scale Validation of Hypothesis Generation Systems via Candidate Ranking.","authors":"Justin Sybrandt, Michael Shtutman, Ilya Safro","doi":"10.1109/bigdata.2018.8622637","DOIUrl":"10.1109/bigdata.2018.8622637","url":null,"abstract":"<p><p>The first step of many research projects is to define and rank a short list of candidates for study. In the modern rapidity of scientific progress, some turn to automated hypothesis generation (HG) systems to aid this process. These systems can identify implicit or overlooked connections within a large scientific corpus, and while their importance grows alongside the pace of science, they lack thorough validation. Without any standard numerical evaluation method, many validate general-purpose HG systems by rediscovering a handful of historical findings, and some wishing to be more thorough may run laboratory experiments based on automatic suggestions. These methods are expensive, time consuming, and cannot scale. Thus, we present a numerical evaluation framework for the purpose of validating HG systems that leverages thousands of validation hypotheses. This method evaluates a HG system by its ability to rank hypotheses by plausibility; a process reminiscent of human candidate selection. Because HG systems do not produce a ranking criteria, specifically those that produce topic models, we additionally present novel metrics to quantify the plausibility of hypotheses given topic model system output. Finally, we demonstrate that our proposed validation method aligns with real-world research goals by deploying our method within MOLIERE, our recent topic-driven HG system, in order to automatically generate a set of candidate genes related to HIV-associated neurodegenerative disease (HAND). By performing laboratory experiments based on this candidate set, we discover a new connection between HAND and Dead Box RNA Helicase 3 (DDX3).</p><p><strong>Reproducibility: </strong>code, validation data, and results can be found at sybrandt.com/2018/validation.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":" ","pages":"1494-1503"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9248026/pdf/nihms-1819102.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"40471919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Interactive Machine Learning by Visualization: A Small Data Solution. 可视化交互式机器学习:小数据解决方案。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/BigData.2018.8621952
Huang Li, Shiaofen Fang, Snehasis Mukhopadhyay, Andrew J Saykin, Li Shen
{"title":"Interactive Machine Learning by Visualization: A Small Data Solution.","authors":"Huang Li,&nbsp;Shiaofen Fang,&nbsp;Snehasis Mukhopadhyay,&nbsp;Andrew J Saykin,&nbsp;Li Shen","doi":"10.1109/BigData.2018.8621952","DOIUrl":"https://doi.org/10.1109/BigData.2018.8621952","url":null,"abstract":"<p><p>Machine learning algorithms and traditional data mining process usually require a large volume of data to train the algorithm-specific models, with little or no user feedback during the model building process. Such a \"big data\" based automatic learning strategy is sometimes unrealistic for applications where data collection or processing is very expensive or difficult, such as in clinical trials. Furthermore, expert knowledge can be very valuable in the model building process in some fields such as biomedical sciences. In this paper, we propose a new visual analytics approach to interactive machine learning and visual data mining. In this approach, multi-dimensional data visualization techniques are employed to facilitate user interactions with the machine learning and mining process. This allows dynamic user feedback in different forms, such as data selection, data labeling, and data correction, to enhance the efficiency of model building. In particular, this approach can significantly reduce the amount of data required for training an accurate model, and therefore can be highly impactful for applications where large amount of data is hard to obtain. The proposed approach is tested on two application problems: the handwriting recognition (classification) problem and the human cognitive score prediction (regression) problem. Both experiments show that visualization supported interactive machine learning and data mining can achieve the same accuracy as an automatic process can with much smaller training data sets.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2018 ","pages":"3513-3521"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BigData.2018.8621952","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37220246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Twitter Health Surveillance (THS) System. Twitter健康监测(THS)系统。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/BigData.2018.8622504
Manuel Rodríguez-Martínez, Cristian C Garzón-Alfonso
{"title":"Twitter Health Surveillance (THS) System.","authors":"Manuel Rodríguez-Martínez,&nbsp;Cristian C Garzón-Alfonso","doi":"10.1109/BigData.2018.8622504","DOIUrl":"https://doi.org/10.1109/BigData.2018.8622504","url":null,"abstract":"<p><p>We present the Twitter Health Surveillance (THS) application framework. THS is designed as an integrated platform to help health officials collect tweets, determine if they are related with a medical condition, extract metadata out of them, and create a big data warehouse that can be used to further analyze the data. THS is built atop open source tools and provides the following value added services: Data Acquisition, Tweet Classification, and Big Data Warehousing. In order to validate THS, we have created a collection of roughly twelve thousands labelled tweets. These tweets contain one or more target medical terms, and the labels indicate if the tweet is related or not to a medical condition. We used this collection to test various models based on LSTM and GRU recurrent neural networks. Our experiments show that we can classify tweets with 96% precision, 92% recall, and 91% F1 score. These results compare favorably with recent research on this area, and show the promise of our THS system.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2018 ","pages":"1647-1654"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BigData.2018.8622504","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36917440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Knowledge-Guided Bayesian Support Vector Machine for High-Dimensional Data with Application to Analysis of Genomics Data. 高维数据的知识引导贝叶斯支持向量机及其在基因组学数据分析中的应用。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2018-12-01 Epub Date: 2019-01-24 DOI: 10.1109/BigData.2018.8622484
Wenli Sun, Changgee Chang, Yize Zhao, Qi Long
{"title":"Knowledge-Guided Bayesian Support Vector Machine for High-Dimensional Data with Application to Analysis of Genomics Data.","authors":"Wenli Sun,&nbsp;Changgee Chang,&nbsp;Yize Zhao,&nbsp;Qi Long","doi":"10.1109/BigData.2018.8622484","DOIUrl":"10.1109/BigData.2018.8622484","url":null,"abstract":"<p><p>Support vector machine (SVM) is a popular classification method for the analysis of wide range of data including big data. Many SVM methods with feature selection have been developed under frequentist regularization or Bayesian shrinkage frameworks. On the other hand, the importance of incorporating a priori known biological knowledge, such as gene pathway information which stems from the gene regulatory network, into the statistical analysis of genomic data has been recognized in recent years. In this article, we propose a new Bayesian SVM approach that enables the feature selection to be guided by the knowledge on the graphical structure among predictors. The proposed method uses the spike-and-slab prior for feature selection, combined with the Ising prior that encourages group-wise selection of the predictors adjacent to each other on the known graph. Gibbs sampling algorithm is used for Bayesian inference. The performance of our method is evaluated and compared with existing SVM methods in terms of prediction and feature selection in extensive simulation settings. In addition, our method is illustrated in the analysis of genomic data from a cancer study, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2018 ","pages":"1484-1493"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BigData.2018.8622484","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"37200110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
VIGAN: Missing View Imputation with Generative Adversarial Networks. VIGAN:利用生成对抗网络进行缺失视图推算。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2017-01-01 Epub Date: 2018-01-15 DOI: 10.1109/BigData.2017.8257992
Chao Shang, Aaron Palmer, Jiangwen Sun, Ko-Shin Chen, Jin Lu, Jinbo Bi
{"title":"VIGAN: Missing View Imputation with Generative Adversarial Networks.","authors":"Chao Shang, Aaron Palmer, Jiangwen Sun, Ko-Shin Chen, Jin Lu, Jinbo Bi","doi":"10.1109/BigData.2017.8257992","DOIUrl":"10.1109/BigData.2017.8257992","url":null,"abstract":"<p><p>In an era when big data are becoming the norm, there is less concern with the quantity but more with the quality and completeness of the data. In many disciplines, data are collected from heterogeneous sources, resulting in multi-view or multi-modal datasets. The missing data problem has been challenging to address in multi-view data analysis. Especially, when certain samples miss an entire view of data, it creates the missing view problem. Classic multiple imputations or matrix completion methods are hardly effective here when no information can be based on in the specific view to impute data for such samples. The commonly-used simple method of removing samples with a missing view can dramatically reduce sample size, thus diminishing the statistical power of a subsequent analysis. In this paper, we propose a novel approach for view imputation via generative adversarial networks (GANs), which we name by VIGAN. This approach first treats each view as a separate domain and identifies domain-to-domain mappings via a GAN using randomly-sampled data from each view, and then employs a multi-modal denoising autoencoder (DAE) to reconstruct the missing view from the GAN outputs based on paired data across the views. Then, by optimizing the GAN and DAE jointly, our model enables the knowledge integration for domain mappings and view correspondences to effectively recover the missing view. Empirical results on benchmark datasets validate the VIGAN approach by comparing against the state of the art. The evaluation of VIGAN in a genetic study of substance use disorders further proves the effectiveness and usability of this approach in life science.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2017 ","pages":"766-775"},"PeriodicalIF":0.0,"publicationDate":"2017-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5813842/pdf/nihms918595.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35842726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Interactive Learning Framework for Scalable Classification of Pathology Images. 病理图像可扩展分类的交互式学习框架
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2015-10-01 Epub Date: 2015-12-28 DOI: 10.1109/BigData.2015.7363841
Michael Nalisnik, David A Gutman, Jun Kong, Lee Ad Cooper
{"title":"An Interactive Learning Framework for Scalable Classification of Pathology Images.","authors":"Michael Nalisnik, David A Gutman, Jun Kong, Lee Ad Cooper","doi":"10.1109/BigData.2015.7363841","DOIUrl":"https://doi.org/10.1109/BigData.2015.7363841","url":null,"abstract":"<p><p>Recent advances in microscopy imaging and genomics have created an explosion of patient data in the pathology domain. Whole-slide images (WSIs) of tissues can now capture disease processes as they unfold in high resolution, recording the visual cues that have been the basis of pathologic diagnosis for over a century. Each WSI contains billions of pixels and up to a million or more microanatomic objects whose appearances hold important prognostic information. Computational image analysis enables the mining of massive WSI datasets to extract quantitative morphologic features describing the visual qualities of patient tissues. When combined with genomic and clinical variables, this quantitative information provides scientists and clinicians with insights into disease biology and patient outcomes. To facilitate interaction with this rich resource, we have developed a web-based machine-learning framework that enables users to rapidly build classifiers using an intuitive active learning process that minimizes data labeling effort. In this paper we describe the architecture and design of this system, and demonstrate its effectiveness through quantification of glioma brain tumors.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2015 ","pages":"928-935"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5082843/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140289855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Big Data Provenance: Challenges, State of the Art and Opportunities. 大数据来源:挑战、技术现状和机遇。
Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data Pub Date : 2015-10-01 Epub Date: 2015-12-28 DOI: 10.1109/BigData.2015.7364047
Jianwu Wang, Daniel Crawl, Shweta Purawat, Mai Nguyen, Ilkay Altintas
{"title":"Big Data Provenance: Challenges, State of the Art and Opportunities.","authors":"Jianwu Wang,&nbsp;Daniel Crawl,&nbsp;Shweta Purawat,&nbsp;Mai Nguyen,&nbsp;Ilkay Altintas","doi":"10.1109/BigData.2015.7364047","DOIUrl":"https://doi.org/10.1109/BigData.2015.7364047","url":null,"abstract":"<p><p>Ability to track provenance is a key feature of scientific workflows to support data lineage and reproducibility. The challenges that are introduced by the volume, variety and velocity of Big Data, also pose related challenges for provenance and quality of Big Data, defined as veracity. The increasing size and variety of distributed Big Data provenance information bring new technical challenges and opportunities throughout the provenance lifecycle including recording, querying, sharing and utilization. This paper discusses the challenges and opportunities of Big Data provenance related to the veracity of the datasets themselves and the provenance of the analytical processes that analyze these datasets. It also explains our current efforts towards tracking and utilizing Big Data provenance using workflows as a programming model to analyze Big Data.</p>","PeriodicalId":74501,"journal":{"name":"Proceedings : ... IEEE International Conference on Big Data. IEEE International Conference on Big Data","volume":"2015 ","pages":"2509-2516"},"PeriodicalIF":0.0,"publicationDate":"2015-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/BigData.2015.7364047","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35792375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 78
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信