A. Wahl, Christian Sauerhammer, Peter K. Schwab, Sebastian Herbst, R. Lenz
{"title":"Query-Driven Data Profiling with OCEANProfile","authors":"A. Wahl, Christian Sauerhammer, Peter K. Schwab, Sebastian Herbst, R. Lenz","doi":"10.1145/3242153.3242154","DOIUrl":"https://doi.org/10.1145/3242153.3242154","url":null,"abstract":"Complex data analysis scenarios often require discovering and combining multiple data sources. Data scientists usually formulate a series of SQL queries building on each other, also called a session, to iteratively derive results. However, due to a lack of familiarity with data sources or the complexity of query results, it can be a hard task to decide on the next query iteration solely based on the results of the last one. While existing approaches provide mechanisms to assess the results of a specific query, support for analyzing results in the context of the respective session remains mostly absent. Such approaches do also not seamlessly integrate with established tools and workflows. To overcome these problems, we introduce OCEANProfile, a framework for session-based profiling of query results. Query results are intercepted at driver level and streamed into our framework for automated data profiling. Result profiles can be compared with those of previous queries and visualized in a companion app compatible with existing analysis tools. Visualizations are automatically ranked according to their usefulness in the context of the respective session.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130331395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Oracle TimesTen Scaleout: A New Scale-Out In-Memory Database Architecture for Extreme OLTP","authors":"Y. Chou, A. Raghavan, T. Lahiri","doi":"10.1145/3242153.3271881","DOIUrl":"https://doi.org/10.1145/3242153.3271881","url":null,"abstract":"Oracle TimesTen Scaleout is a shared-nothing scale-out inmemory database designed for extreme OLTP workloads, such as IoT, real-time fraud detection, telecommunications etc. TimesTen Scaleout features rich SQL including complex queries with joins, aggregations and analytic functions, transparent distributed execution, full ACID multi-statement transactions, global secondary indexes and sequences. The design features built-in high availability using a k-safe data duplication mechanism, and transparent failure handling in order to minimize application down time. All management functions such as installation, configuration, and monitoring are provided via a centralized management repository. We describe some of the challenges in developing such a scale-out in-memory database architecture and show extreme performance results exceeding 100 million transactions per second and 1 billion SQL selects per second.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"32 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114270739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","authors":"","doi":"10.1145/3242153","DOIUrl":"https://doi.org/10.1145/3242153","url":null,"abstract":"","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116248382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Foroni, Cristian Axenie, S. Bortoli, Mohamad Al Hajj Hassan, Ralph Acker, R. Tudoran, G. Brasche, Yannis Velegrakis
{"title":"Moira: A Goal-Oriented Incremental Machine Learning Approach to Dynamic Resource Cost Estimation in Distributed Stream Processing Systems","authors":"D. Foroni, Cristian Axenie, S. Bortoli, Mohamad Al Hajj Hassan, Ralph Acker, R. Tudoran, G. Brasche, Yannis Velegrakis","doi":"10.1145/3242153.3242160","DOIUrl":"https://doi.org/10.1145/3242153.3242160","url":null,"abstract":"The need for real-time analysis is still spreading and the number of available streaming sources is increasing. The recent literature has plenty of works on Data Stream Processing (DSP). In a streaming environment, the data incoming rate varies over time. The challenge is how to efficiently deploy these applications in a cluster. Several works have been conducted on improving the latency of the system or to minimize the allocated resources per application through time. However, to the best of our knowledge, none of the existing works takes into consideration the user needs for a specific application, which is different from one user to another. In this paper, we propose Moria, a goal-oriented framework for dynamically optimizing the resource allocation built on top of Apache Flink. The system takes actions based on the user application and on the incoming data characteristics (i.e., input rate and window size). Starting from an initial estimation of the resources needed for the user query, at each iteration we improve our cost function with the collected metrics from the monitored system about the incoming data, to fulfill the user needs. We present a series of experiments that show in which cases our dynamic estimation outperforms the baseline Apache Flink and the thumb rule estimation alone performed at the deployment of the applications.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124676771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Sax, Guozhang Wang, M. Weidlich, J. Freytag
{"title":"Streams and Tables: Two Sides of the Same Coin","authors":"Matthias Sax, Guozhang Wang, M. Weidlich, J. Freytag","doi":"10.1145/3242153.3242155","DOIUrl":"https://doi.org/10.1145/3242153.3242155","url":null,"abstract":"Stream processing has emerged as a paradigm for applications that require low-latency evaluation of operators over unbounded sequences of data. Defining the semantics of stream processing is challenging in the presence of distributed data sources. The physical and logical order of data in a stream may become inconsistent in such a setting. Existing models either neglect these inconsistencies or handle them by means of data buffering and reordering techniques, thereby compromising processing latency. In this paper, we introduce the Dual Streaming Model to reason about physical and logical order in data stream processing. This model presents the result of an operator as a stream of successive updates, which induces a duality of results and streams. As such, it provides a natural way to cope with inconsistencies between the physical and logical order of streaming data in a continuous manner, without explicit buffering and reordering. We further discuss the trade-offs and challenges faced when implementing this model in terms of correctness, latency, and processing cost. A case study based on Apache Kafka illustrates the effectiveness of our model in the light of real-world requirements.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131120092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fei Song, Boyao Zhou, Quan Sun, Wang Sun, Shiwen Xia, Y. Diao
{"title":"Anomaly Detection and Explanation Discovery on Event Streams","authors":"Fei Song, Boyao Zhou, Quan Sun, Wang Sun, Shiwen Xia, Y. Diao","doi":"10.1145/3242153.3242158","DOIUrl":"https://doi.org/10.1145/3242153.3242158","url":null,"abstract":"As enterprise information systems are collecting event streams from various sources, the ability of a system to automatically detect anomalous events and further provide human readable explanations is of paramount importance. In this position paper, we argue for the need of a new type of data stream analytics that can address anomaly detection and explanation discovery in a single, integrated system, which not only offers increased business intelligence, but also opens up opportunities for improved solutions. In particular, we propose a two-pass approach to building such a system, highlight the challenges, and offer initial directions for solutions.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127831923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alok Pareek, Bhushan Khaladkar, Rajkumar Sen, Basar Onat, V. Nadimpalli, L. Mahadevan
{"title":"Real-time ETL in Striim","authors":"Alok Pareek, Bhushan Khaladkar, Rajkumar Sen, Basar Onat, V. Nadimpalli, L. Mahadevan","doi":"10.1145/3242153.3242157","DOIUrl":"https://doi.org/10.1145/3242153.3242157","url":null,"abstract":"In the new digital economy, on demand access of real time enterprise data is critical to modernize cross organizational, cross partner, and online consumer functions. In addition to on premise legacy data, enterprises are producing an enormous amount of real-time data through new hybrid cloud applications; these event streams need to be collected, transformed and analyzed in real-time to make critical business decision. Traditional Extract-Load-Transform (ETL) processes are no longer sufficient and need to be re-architected to account for streaming, heterogeneity, usability, extensibility (custom processing), and continuous validity. Striim is a novel end-to-end distributed streaming ETL and intelligence platform that enables rapid development and deployment of streaming applications. Striim's real-time ETL engine has been architected from ground-up to enable both business users and developers to build and deploy streaming applications. In this paper, we describe some of the core features of Striim's ETL engine (i) built-in adapters to extract and load data in real-time from legacy and new cloud sources/targets (ii) an extensible SQL-based transformation engine to transform events; users can inject custom logic via a component called Open Processor (iv) New primitives like MODIFY, BEFORE and AFTER and (v) built-in data validation that continuously checks if everything is continually making it to the destination.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116439741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Silverio Martínez-Fernández, P. Jovanovic, Xavier Franch, Andreas Jedlitschka
{"title":"Towards Automated Data Integration in Software Analytics","authors":"Silverio Martínez-Fernández, P. Jovanovic, Xavier Franch, Andreas Jedlitschka","doi":"10.1145/3242153.3242159","DOIUrl":"https://doi.org/10.1145/3242153.3242159","url":null,"abstract":"Software organizations want to be able to base their decisions on the latest set of available data and the real-time analytics derived from them. In order to support \"real-time enterprise\" for software organizations and provide information transparency for diverse stakeholders, we integrate heterogeneous data sources about software analytics, such as static code analysis, testing results, issue tracking systems, network monitoring systems, etc. To deal with the heterogeneity of the underlying data sources, we follow an ontology-based data integration approach in this paper and define an ontology that captures the semantics of relevant data for software analytics. Furthermore, we focus on the integration of such data sources by proposing two approaches: a static and a dynamic one. We first discuss the current static approach with a predefined set of analytic views representing software quality factors and further envision how this process could be automated in order to dynamically build custom user analysis using a semi-automatic platform for managing the lifecycle of analytics infrastructures.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116797021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaoqiang Chen, Qun Chen, Boyi Hou, Ahmed Murtadha, Zhanhuai Li
{"title":"Improving Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective","authors":"Zhaoqiang Chen, Qun Chen, Boyi Hou, Ahmed Murtadha, Zhanhuai Li","doi":"10.1145/3242153.3242156","DOIUrl":"https://doi.org/10.1145/3242153.3242156","url":null,"abstract":"Pure machine-based solutions usually struggle in the challenging classification tasks such as entity resolution (ER). To alleviate this problem, a recent trend is to involve the human in the resolution process, most notably the crowdsourcing approach. However, it remains very challenging to effectively improve machine-based entity resolution with limited human effort. In this paper, we investigate the problem of human and machine cooperation for ER from a risk perspective. We propose to select the machine-labeled instances at high risk of being mislabeled for manual verification. For this task, we present a risk model that takes into consideration the human-labeled instances as well as the output of machine resolution. Finally, we evaluate the performance of the proposed risk model on real data. Our experiments demonstrate that it can pick up the mislabeled instances with considerably higher accuracy than the existing alternatives. Provided with the same amount of human cost budget, it can also achieve better resolution quality than the state-of-the-art approach based on active learning.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128074684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performing OLAP over Graph Data: Query Language, Implementation, and a Case Study","authors":"Leticia I. Gómez, B. Kuijpers, A. Vaisman","doi":"10.1145/3129292.3129293","DOIUrl":"https://doi.org/10.1145/3129292.3129293","url":null,"abstract":"In current Big Data scenarios, traditional data warehousing and Online Analytical Processing (OLAP) operations on cubes are clearly not sufficient to address the current data analysis requirements. Nevertheless, OLAP operations and models can expand the possibilities of graph analysis beyond the traditional graph-based computation. In spite of this, there is not much work on the problem of taking OLAP analysis to the graph data model. In previous work we proposed a multidimensional (MD) data model for graph analysis, that considers not only the basic graph data, but background information in the form of dimension hierarchies as well. The graphs in our model are node- and edge-labelled directed multi-hypergraphs, called graphoids, defined at several different levels of granularity. In this paper we show how we implemented this proposal over the widely used Neo4J graph database, discuss implementation issues, and present a detailed case study to show how OLAP operations can be used on graphs.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114388126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}