Data4U '14 Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658845

R. Neamtu, Ramoza Ahsan, J. Stokes, Armend Hoxha, Jialiang Bao, Stefan Gvozdenovic, Ted Meyer, Nilesh Patel, Raghu Rangan, Yumou Wang, Dongyun Zhang, Elke A. Rundensteiner

{"title":"Taming Big Data: Integrating diverse public data sources for economic competitiveness analytics","authors":"R. Neamtu, Ramoza Ahsan, J. Stokes, Armend Hoxha, Jialiang Bao, Stefan Gvozdenovic, Ted Meyer, Nilesh Patel, Raghu Rangan, Yumou Wang, Dongyun Zhang, Elke A. Rundensteiner","doi":"10.1145/2658840.2658845","DOIUrl":"https://doi.org/10.1145/2658840.2658845","url":null,"abstract":"In an era where Big Data can greatly impact a broad population, many novel opportunities arise, chief among them the ability to integrate data from diverse sources and \"wrangle\" it to extract novel insights. Conceived as a tool that can help both expert and non-expert users better understand public data, MATTERS was collaboratively developed by the Massachusetts High Tech Council, WPI and other institutions as an analytic platform offering dynamic modeling capabilities. MATTERS is an integrative data source on high fidelity cost and talent competitiveness metrics. Its goal is to extract, integrate and model rich economic, financial, educational and technological information from renowned heterogeneous web data sources ranging from The US Census Bureau, The Bureau of Labor Statistics to the Institute of Education Sciences, all known to be critical factors influencing economic competitiveness of states. This demonstration of MATTERS illustrates how we tackle challenges of data acquisition, cleaning, integration and wrangling into appropriate representations, visualization and story-telling with data in the context of state competitiveness in the high-tech sector.","PeriodicalId":135661,"journal":{"name":"Data4U '14","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134543287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Paradigm for Learning Queries on Big Data 基于大数据的查询学习范式

Data4U '14 Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658842

A. Bonifati, Radu Ciucanu, Aurélien Lemay, S. Staworko

引用次数: 19

DiNoDB: Efficient Large-Scale Raw Data Analytics 高效的大规模原始数据分析

Data4U '14 Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658841

Yongchao Tian, Ioannis Alagiannis, Erietta Liarou, A. Ailamaki, P. Michiardi, M. Vukolic

{"title":"DiNoDB: Efficient Large-Scale Raw Data Analytics","authors":"Yongchao Tian, Ioannis Alagiannis, Erietta Liarou, A. Ailamaki, P. Michiardi, M. Vukolic","doi":"10.1145/2658840.2658841","DOIUrl":"https://doi.org/10.1145/2658840.2658841","url":null,"abstract":"Modern big data workflows, found in e.g., machine learning use cases, often involve iterations of cycles of batch analytics and interactive analytics on temporary data. Whereas batch analytics solutions for large volumes of raw data are well established (e.g., Hadoop, MapReduce), state-of-the-art interactive analytics solutions (e.g., distributed shared nothing RDBMSs) require data loading and/or transformation phase, which is inherently expensive for temporary data.\u0000 In this paper, we propose a novel scalable distributed solution for in-situ data analytics, that offers both scalable batch and interactive data analytics on raw data, hence avoiding the loading phase bottleneck of RDBMSs. Our system combines a MapReduce based platform with the recently proposed NoDB paradigm, which optimizes traditional centralized RDBMSs for in-situ queries of raw files. We revisit the NoDB's centralized design and scale it out supporting multiple clients and data processing nodes to produce a new distributed data analytics system we call Distributed NoDB (DiNoDB). DiNoDB leverages MapReduce batch queries to produce critical pieces of metadata (e.g., distributed positional maps and vertical indices) to speed up interactive queries without the overheads of the data loading and data movement phases allowing users to quickly and efficiently exploit their data.\u0000 Our experimental analysis demonstrates that DiNoDB significantly reduces the data-to-query latency with respect to comparable state-of-the-art distributed query engines, like Shark, Hive and HadoopDB.","PeriodicalId":135661,"journal":{"name":"Data4U '14","volume":"129 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132591683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

An Efficient Processing of k-Dominant Skyline Query in MapReduce MapReduce中k-Dominant Skyline查询的高效处理

Data4U '14 Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658846

Hao Tian, M. A. Siddique, Y. Morimoto

引用次数: 9

Affordable Analytics on Expensive Data 基于昂贵数据的平价分析

Data4U '14 Pub Date : 2014-09-01 DOI: 10.1145/2658840.2658844

P. Upadhyaya, Martina Unutzer, M. Balazinska, Dan Suciu, Hakan Hacıgümüş

引用次数: 2

Data4U '14最新文献