抽干数据沼泽:基于相似性的方法

Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.) Pub Date : 2018-06-10 DOI:10.1145/3209900.3209911

Will Brackenbury, Rui Liu, Mainack Mondal, Aaron J. Elmore, Blase Ur, K. Chard, M. Franklin

{"title":"抽干数据沼泽:基于相似性的方法","authors":"Will Brackenbury, Rui Liu, Mainack Mondal, Aaron J. Elmore, Blase Ur, K. Chard, M. Franklin","doi":"10.1145/3209900.3209911","DOIUrl":null,"url":null,"abstract":"While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called \"data lakes\" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, \"data lakes\" become \"data swamps.\" Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.","PeriodicalId":92279,"journal":{"name":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","volume":"47 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2018-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":"{\"title\":\"Draining the Data Swamp: A Similarity-based Approach\",\"authors\":\"Will Brackenbury, Rui Liu, Mainack Mondal, Aaron J. Elmore, Blase Ur, K. Chard, M. Franklin\",\"doi\":\"10.1145/3209900.3209911\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called \\\"data lakes\\\" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, \\\"data lakes\\\" become \\\"data swamps.\\\" Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.\",\"PeriodicalId\":92279,\"journal\":{\"name\":\"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)\",\"volume\":\"47 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"25\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3209900.3209911\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3209900.3209911","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

摘要

虽然诸如文件系统和存储库之类的分层名称空间长期以来一直用于组织数据，但数据生产的快速增长给希望使用数据的用户带来了越来越大的压力。所谓的“数据湖”以自然的形式存储数据，以现收现付的方式进行集成和组织。虽然该模型推迟了集成的前期成本，但结果是数据在处理之前无法用于发现或分析。因此，数据科学家被迫在数据发现、清理、集成和管理等日常任务上花费大量时间和精力——如果忽视这些，“数据湖”就会变成“数据沼泽”。先前的工作表明，纯计算方法解决数据发现和管理组件的问题是不够的。在这里，我们提供了证据来证实这一假设，表明自动文件聚类等方法无法从存储库中提取必要的特征，从而为最终用户数据科学家提供有用的信息，或者代表他们做出有效的数据管理决策。我们认为，需要将指定文件相似度的框架和人在循环中的交互框架结合起来，以帮助实现自动化组织。我们在这里提出了一个初步步骤，将项目可能被认为相似的几个维度进行分类:数据，其来源和当前特征。我们最初在确定可以集成或集体管理的数据的上下文中考虑这个模型。我们还探讨了如何使用当前的方法来使用现实世界的数据存储库和文件系统来自动化决策，并建议如何开发在线用户研究来进一步验证这一假设。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Draining the Data Swamp: A Similarity-based Approach

While hierarchical namespaces such as filesystems and repositories have long been used to organize data, the rapid increase in data production places increasing strain on users who wish to make use of the data. So called "data lakes" embrace the storage of data in its natural form, integrating and organizing in a Pay-as-you-go fashion. While this model defers the upfront cost of integration, the result is that data is unusable for discovery or analysis until it is processed. Thus, data scientists are forced to spend significant time and energy on mundane tasks such as data discovery, cleaning, integration, and management -- when this is neglected, "data lakes" become "data swamps." Prior work suggests that pure computational methods for resolving issues with the data discovery and management components are insufficient. Here, we provide evidence to confirm this hypothesis, showing that methods such as automated file clustering are unable to extract the necessary features from repositories to provide useful information to end-user data scientists, or make effective data management decisions on their behalf. We argue that the combination of frameworks for specifying file similarity and human-in-the-loop interaction is needed to aid automated organization. We propose an initial step here, classifying several dimensions by which items may be considered similar: the data, its origin, and its current characteristics. We initially consider this model in the context of identifying data that can be integrated or managed collectively. We additionally explore how current methods can be used to automate decision making using real-world data repository and file systems, and suggest how an online user study could be developed to further validate this hypothesis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2nd Workshop on Human-In-the-Loop Data Analytics. Workshop on Human-In-the-Loop Data Analytics (2nd : 2017 : Chicago, Ill.)

自引率

0.00%

发文量