{"title":"Interactive Data Cleaning for Real-Time Streaming Applications","authors":"Timo Räth, Ngozichukwuka Onah, K. Sattler","doi":"10.1145/3597465.3605229","DOIUrl":"https://doi.org/10.1145/3597465.3605229","url":null,"abstract":"The importance of data cleaning systems has continuously grown in recent years. Especially for real-time streaming applications, it is crucial, to identify and possibly remove anomalies in the data on the fly before further processing. The main challenge however lies in the construction of an appropriate data cleaning pipeline, which is complicated by the dynamic nature of streaming applications. To simplify this process and help data scientists to explore and understand the incoming data, we propose an interactive data cleaning system for streaming applications. In this paper, we list requirements for such a system and present our implementation to overcome the stated issues. Our demonstration shows, how a data cleaning pipeline can be interactively created, executed, and monitored at runtime. We also present several different tools, such as the automated advisor and the adaptive visualizer, that engage the user in the data cleaning process and help them understand the behavior of the pipeline.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79395014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Facilitating Dependency Exploration in Computational Notebooks","authors":"C. Brown, Hamed Alhoori, D. Koop","doi":"10.1145/3597465.3605222","DOIUrl":"https://doi.org/10.1145/3597465.3605222","url":null,"abstract":"Computational notebooks promote exploration by structuring code, output, and explanatory text, into cells. The input code and rich outputs help users iteratively investigate ideas as they explore or analyze data. The links between these cells--how the cells depend on each other--are important in understanding how analyses have been developed and how the results can be reproduced. Specifically, a code cell that uses a particular identifier depends on the cell where that identifier is defined or mutated. Because notebooks promote fluid editing where cells can be moved and run in any order, cell dependencies are not always clear or easy to follow. We examine different tools that seek to address this problem by extending Jupyter notebooks and evaluate how well they support users in accomplishing tasks that require understanding dependencies. We also evaluate visualization techniques that provide views of the dependencies to help users navigate cell dependencies.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"446 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86857275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Overlay Spreadsheets","authors":"Oliver Kennedy, Boris Glavic, Mike Brachmann","doi":"10.1145/3597465.3605220","DOIUrl":"https://doi.org/10.1145/3597465.3605220","url":null,"abstract":"Efforts to scale spreadsheets either follow a 'virtual' strategy that layers a spreadsheet interface on top of an existing database engine or a 'materialized' strategy based on re-engineering a spreadsheet engine. Because databases are not optimized for spreadsheet access patterns, the materialized approach has better performance. However, the virtual approach offers several advantages that can not be easily replicated in the materialized approach, including the ability to re-apply user interactions to an updated input dataset. We propose the overlay update model, a hybrid approach that overlays user updates on an existing dataset (as in the virtual approach) and indexes user updates (as in the materialized approach). A key feature of our approach is storing updates generated by bulk operations (e.g., copy/paste) as compact \"patterns\" that can be leveraged to reduce execution costs. We implement an overlay spreadsheet over Apache Spark and demonstrate that, compared to DataSpread (a materialized spreadsheet), it can significantly reduce execution costs.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"83 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78923408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Camera-First Form Filling: Reducing the Friction in Climate Hazard Reporting","authors":"Kristina Wolf, Dominik Winecki, Arnab Nandi","doi":"10.1145/3597465.3605218","DOIUrl":"https://doi.org/10.1145/3597465.3605218","url":null,"abstract":"The effective reporting of climate hazards, such as flash floods, hurricanes, and earthquakes, is critical. To quickly and correctly assess the situation and deploy resou rces, emergency services often rely on citizen reports that must be timely, comprehensive, and accurate. The pervasive availability and use of smartphone cameras allow the transmission of dynamic incident information from citizens in near-real-time. While high-quality reporting is beneficial, generating such reports can place an additional burden on citizens who are already suffering from the stress of a climate-related disaster. Furthermore, reporting methods are often challenging to use, due to their length and complexity. In this paper, we explore reducing the friction of climate hazard reporting by automating parts of the form-filling process. By building on existing computer vision and natural language models, we demonstrate the automated generation of a full-form hazard impact assessment report from a single photograph. Our proposed data pipeline can be integrated with existing systems and used with geospatial data solutions, such as flood hazard maps.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"118 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84066372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DIG: The Data Interface Grammar","authors":"Yiru Chen, Jeffery Tao, Eugene Wu","doi":"10.1145/3597465.3605223","DOIUrl":"https://doi.org/10.1145/3597465.3605223","url":null,"abstract":"Building interactive data interfaces is hard because the design of an interface depends on the data processing needs for the underlying analysis task, yet we do not have a good representation for analysis tasks. To fill this gap, this paper advocates for a Data Interface Grammar (DIG) as an intermediate representation of analysis tasks. We show that DIG is compatible with existing data engineering practices, compact to represent any analysis, simple to translate into an interface design, and amenable to offline analysis. We further illustrate the potential benefits of this abstraction, such as automatic interface generation, automatic interface backend optimization, tutorial generation, and workload generation.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81172319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SliceLens","authors":"Daniel Kerrigan, Enrico Bertini","doi":"10.1145/3597465.3605217","DOIUrl":"https://doi.org/10.1145/3597465.3605217","url":null,"abstract":"SliceLens is a tool for exploring labeled, tabular, machine learning datasets. To explore a dataset, the user selects combinations of features in the dataset that they are interested in. The tool splits those features into bins and then visualizes the label distributions for the subsets of data created by the intersections of the bins. SliceLens guides the user in determining which feature combinations to explore. Guidance is based on a user-selected rating metric, which assigns a score to the subsets created by a given combination of features. The purpose of the metrics are to detect interesting patterns in the subsets, such as subsets that have high label purity or an uneven distribution of errors. SliceLens uses the metrics to guide the user towards combinations of features that create potentially interesting subsets in two ways. First, SliceLens assigns a rating to each feature based on the subsets that would be created by selecting that feature. This incremental guidance can help the user determine which feature to select next. Second, SliceLens can suggest combinations of features ranked according to the chosen metric, which the user can then cycle through.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"219 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74658116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Aggregation Consistency Errors in Semantic Layers and How to Avoid Them","authors":"Zezhou Huang, Pavan Kalyan Damalapati, Eugene Wu","doi":"10.1145/3597465.3605224","DOIUrl":"https://doi.org/10.1145/3597465.3605224","url":null,"abstract":"Analysts often struggle with analyzing data from multiple tables in a database due to their lack of knowledge on how to join and aggregate the data. To address this, data engineers pre-specify \"semantic layers\" which include the join conditions and \"metrics\" of interest with aggregation functions and expressions. However, joins can cause \"aggregation consistency issues\". For example, analysts may observe inflated total revenue caused by double counting from join fanouts. Existing BI tools rely on heuristics for deduplication, resulting in imprecise and challenging-to-understand outcomes. To overcome these challenges, we propose \"weighing\" as a core primitive to counteract join fanouts. \"Weighing\" has been used in various areas, such as market attribution and order management, ensuring metrics consistency (e.g., total revenue remains the same) even for many-to-many joins. The idea is to assign equal weight to each join key group (rather than each tuple) and then distribute the weights among tuples. Implementing weighing techniques necessitates user input; therefore, we recommend a human-in-the-loop framework that enables users to iteratively explore different strategies and visualize the results.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77290909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengqi Zhang, Pranay Mundra, Chukwubuikem Chikweze, F. Nargesian, G. Weikum
{"title":"Approximate Query Answering over Open Data","authors":"Mengqi Zhang, Pranay Mundra, Chukwubuikem Chikweze, F. Nargesian, G. Weikum","doi":"10.1145/3597465.3605227","DOIUrl":"https://doi.org/10.1145/3597465.3605227","url":null,"abstract":"Open knowledge, including open data and publicly available knowledge bases, offers a rich opportunity for data scientists for analysis and query answering, but comes with big obstacles due to the diverse, noisy, and incomplete nature of its data eco-system. This paper proposes a vision for enabling approximate QUery answering over Open Knowledge (Quok), with a focus on supporting analytic tasks that involve identifying relevant data and computing aggregations. We define the problem, outline a system architecture, and discuss challenges and approaches to taming the uncertainty and incompleteness of open knowledge.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"46 Suppl 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88839030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VALUE","authors":"Kaustav Bhattacharjee, Aritra Dasgupta","doi":"10.1145/3597465.3605225","DOIUrl":"https://doi.org/10.1145/3597465.3605225","url":null,"abstract":"The widespread adoption of open datasets across various domains has emphasized the significance of joining and computing their utility. However, the interplay between computation and human interaction is vital for informed decision-making. To address this issue, we first propose a utility metric to calibrate the usefulness of open datasets when joined with other such datasets. Further, we distill this utility metric through a visual analytic framework called VALUE, which empowers the researchers to identify joinable datasets, prioritize them based on their utility, and inspect the joined dataset. This transparent evaluation of the utility of the joined datasets is implemented through a human-in-the-loop approach where the researchers can adapt and refine the selection criteria according to their mental model of utility. Finally, we demonstrate the effectiveness of our approach through a usage scenario using real-world open datasets.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73107076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Visualizing a Tabular Data Repository to Facilitate Descriptive Tag Augmentation for New Tables","authors":"Jianhao Cao, T. Munzner, R. Pottinger","doi":"10.1145/3597465.3605226","DOIUrl":"https://doi.org/10.1145/3597465.3605226","url":null,"abstract":"Many online tabular datasets are maintained in centralized repositories and annotated with descriptive tags. These tags are helpful for data practitioners to search and understand tables. However, manually annotating descriptive tags for new tables added to a large repository is expensive and may be inconsistent. In this extended abstract, we propose tag inference methods and implement an interactive visual explainer prototype to visualize a table repository with respect to a new table and to help a human user examine whether a recommended tag is suitable for the new table.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"47 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2023-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89621706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}