The Data Problem in Data Mining

SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining Pub Date : 2015-05-21 DOI:10.1145/2783702.2783706

Albrecht Zimmermann

引用次数: 14

Abstract

Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data. In this position paper, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking insight into three key aspects: 1) How pattern mining algorithms perform quantitatively, 2) How to choose parameter settings, and 3) How to relate found patterns to the processes that generated the data. I illustrate the issue by surveying existing work in light of these concerns and pointing to the (relatively few) papers that have attempted to fill in the gaps. I argue further that progress regarding those questions is held back by a lack of data with varying, controlled properties, and that this lack is unlikely to be remedied by the ever increasing collection of real-life data. Instead, I am convinced that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.

查看原文本刊更多论文

数据挖掘中的数据问题

计算机科学本质上是一门应用科学或工程科学，它创造工具。在数据挖掘中，这些工具应该帮助人类理解大量数据。在这篇立场文件中，我认为，对于数据挖掘，特别是模式挖掘所取得的所有进展，我们缺乏对三个关键方面的洞察:1)模式挖掘算法如何定量执行，2)如何选择参数设置，以及3)如何将发现的模式与生成数据的过程联系起来。为了说明这个问题，我根据这些问题调查了现有的工作，并指出(相对较少的)试图填补空白的论文。我进一步认为，关于这些问题的进展是由于缺乏具有变化的、受控制的属性的数据而受到阻碍，而且这种缺乏不太可能通过不断增加的真实数据收集来弥补。相反，我相信我们需要将数字数据生成作为一门科学，并用它来指导数据从业者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

SIGKDD explorations : newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining

自引率

0.00%

发文量