Data Reduction Network

Proceedings of the Python in Science Conference Pub Date : 1900-01-01 DOI:10.25080/gerudo-f2bc6f59-012

Haoyin Xu, Haw-minn Lu, J. Unpingco

{"title":"Data Reduction Network","authors":"Haoyin Xu, Haw-minn Lu, J. Unpingco","doi":"10.25080/gerudo-f2bc6f59-012","DOIUrl":null,"url":null,"abstract":"—Multidimensional categorical data is widespread but not easily visualized using standard methods. For example, questionnaire (e.g. survey) data generally consists of questions with categorical responses (e.g., yes/no, hate/dislike/neutral/like/love). Thus, a questionnaire with 10 questions, each with five mutually exclusive responses, gives a dataset of 5 10 possible observations, an amount of data that would be hard to reasonably collect. Hence, this type of dataset is necessarily sparse. Popular methods of handling categorical data include one-hot encoding (which exacerbates the dimensionality problem) and enumeration, which applies an unwarranted and potentially misleading notional order to the data. To address this, we introduce a novel visualization method named Data Reduction Network (DRN). Using a network-graph structure, the DRN denotes each categorical feature as a node with interrelationships between nodes denoted by weighted edges. The graph is statistically reduced to reveal the strongest or weakest path-wise relationships between features and to reduce visual clutter. A key advantage is that it does not “lose” features, but rather represents interrelationships across the entire categorical feature set without eliminating weaker relationships or features. Indeed, the graph representation can be inverted so that instead of visualizing the strongest interrelationships, the weakest can be surfaced. The DRN is a powerful visualization tool for multi-dimensional categorical data and in particular data derived from surveys and questionaires.","PeriodicalId":364654,"journal":{"name":"Proceedings of the Python in Science Conference","volume":"136-137 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Python in Science Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.25080/gerudo-f2bc6f59-012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

—Multidimensional categorical data is widespread but not easily visualized using standard methods. For example, questionnaire (e.g. survey) data generally consists of questions with categorical responses (e.g., yes/no, hate/dislike/neutral/like/love). Thus, a questionnaire with 10 questions, each with five mutually exclusive responses, gives a dataset of 5 10 possible observations, an amount of data that would be hard to reasonably collect. Hence, this type of dataset is necessarily sparse. Popular methods of handling categorical data include one-hot encoding (which exacerbates the dimensionality problem) and enumeration, which applies an unwarranted and potentially misleading notional order to the data. To address this, we introduce a novel visualization method named Data Reduction Network (DRN). Using a network-graph structure, the DRN denotes each categorical feature as a node with interrelationships between nodes denoted by weighted edges. The graph is statistically reduced to reveal the strongest or weakest path-wise relationships between features and to reduce visual clutter. A key advantage is that it does not “lose” features, but rather represents interrelationships across the entire categorical feature set without eliminating weaker relationships or features. Indeed, the graph representation can be inverted so that instead of visualizing the strongest interrelationships, the weakest can be surfaced. The DRN is a powerful visualization tool for multi-dimensional categorical data and in particular data derived from surveys and questionaires.

查看原文本刊更多论文

数据简化网络

多维分类数据广泛存在，但不容易使用标准方法进行可视化。例如，问卷(例如调查)数据通常由带有明确回答的问题组成(例如，是/否，讨厌/不喜欢/中性/喜欢/喜欢)。因此，一份有10个问题的问卷，每个问题有5个相互排斥的答案，给出了一个包含5个10个可能观察结果的数据集，这是一份很难合理收集的数据量。因此，这种类型的数据集必然是稀疏的。处理分类数据的常用方法包括one-hot编码(这加剧了维数问题)和枚举，后者对数据应用了一种毫无根据且可能具有误导性的概念顺序。为了解决这个问题，我们引入了一种新的可视化方法——数据约简网络(DRN)。使用网络图结构，DRN将每个分类特征表示为一个节点，节点之间的相互关系用加权边表示。图形经过统计简化，以揭示特征之间最强或最弱的路径关系，并减少视觉混乱。一个关键的优点是，它不会“丢失”特征，而是表示整个分类特征集之间的相互关系，而不会消除较弱的关系或特征。事实上，图形表示可以倒过来，这样就可以显示最弱的相互关系，而不是可视化最强的相互关系。DRN是一个强大的可视化工具，用于多维分类数据，特别是来自调查和问卷的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Python in Science Conference

自引率

0.00%

发文量