Data Mining and Applied Linear Algebra

International Conference on Informatics Education and Research for Knowledge-Circulating Society (icks 2008) Pub Date : 2008-01-17 DOI:10.1109/ICKS.2008.39

M. Chu

{"title":"Data Mining and Applied Linear Algebra","authors":"M. Chu","doi":"10.1109/ICKS.2008.39","DOIUrl":null,"url":null,"abstract":"In this era of hyper-technological innovation, massive amounts of data are being generated at almost every level of applications in almost every area of disciplines. Extracting interesting knowledge from raw data, or data mining in a broader sense, has become an indispensable task. Nevertheless, data collected from complex phenomena represent often the integrated result of several interrelated variables, whereas these variables are less precisely defined. The basic principle of data mining is to distinguish which variable is related to which and how the variables are related. In many situations, the digitized information is gathered and stored as a data matrix. It is often the case, or so assumed, that the exogenous variables depend on the endogenous variables in a linear relationship. Retrieving \"useful\" information therefore can often be characterized as finding \"suitable\" matrix factorization. This paper offers a synopsis from this prospect on how linear algebra techniques can help to carry out the task of data mining. Examples from factor analysis, cluster analysis, latent semantic indexing and link analysis are used to demonstrate how matrix factorization helps to uncover hidden connection and do things fast. Low rank matrix approximation plays a fundamental role in cleaning the data and compressing the data. Other types of constraints, such as nonnegativity, will also be briefly discussed.","PeriodicalId":443068,"journal":{"name":"International Conference on Informatics Education and Research for Knowledge-Circulating Society (icks 2008)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Informatics Education and Research for Knowledge-Circulating Society (icks 2008)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICKS.2008.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In this era of hyper-technological innovation, massive amounts of data are being generated at almost every level of applications in almost every area of disciplines. Extracting interesting knowledge from raw data, or data mining in a broader sense, has become an indispensable task. Nevertheless, data collected from complex phenomena represent often the integrated result of several interrelated variables, whereas these variables are less precisely defined. The basic principle of data mining is to distinguish which variable is related to which and how the variables are related. In many situations, the digitized information is gathered and stored as a data matrix. It is often the case, or so assumed, that the exogenous variables depend on the endogenous variables in a linear relationship. Retrieving "useful" information therefore can often be characterized as finding "suitable" matrix factorization. This paper offers a synopsis from this prospect on how linear algebra techniques can help to carry out the task of data mining. Examples from factor analysis, cluster analysis, latent semantic indexing and link analysis are used to demonstrate how matrix factorization helps to uncover hidden connection and do things fast. Low rank matrix approximation plays a fundamental role in cleaning the data and compressing the data. Other types of constraints, such as nonnegativity, will also be briefly discussed.

查看原文本刊更多论文

数据挖掘与应用线性代数

在这个高度科技创新的时代，在几乎每个学科领域的几乎每个应用级别都产生了大量数据。从原始数据中提取有趣的知识，或者更广泛意义上的数据挖掘，已经成为一项不可或缺的任务。然而，从复杂现象中收集的数据通常代表几个相互关联的变量的综合结果，而这些变量的定义不太精确。数据挖掘的基本原理是区分哪些变量与哪些变量相关，以及变量之间如何相关。在许多情况下，数字化信息被收集并以数据矩阵的形式存储。通常情况下，或者假设，外生变量依赖于内生变量的线性关系。因此，检索“有用的”信息通常可以被描述为找到“合适的”矩阵分解。本文从这一角度概述了线性代数技术如何帮助完成数据挖掘任务。因子分析、聚类分析、潜在语义索引和链接分析的例子演示了矩阵分解如何帮助发现隐藏的联系并快速完成任务。低秩矩阵逼近在数据清理和数据压缩中起着至关重要的作用。其他类型的约束，如非否定性，也将简要讨论。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Informatics Education and Research for Knowledge-Circulating Society (icks 2008)

自引率

0.00%

发文量