Computing Mutual Information of Big Categorical Data and Its Application to Feature Grouping

2020 IEEE 36th International Conference on Data Engineering (ICDE) Pub Date : 2020-04-01 DOI:10.1109/ICDE48307.2020.00210

Junli Li, Chaowei Zhang, Jifu Zhang, X. Qin

引用次数: 2

Abstract

This paper develops a parallel computing system - MiCS - for mutual information of big categorical data on the Spark computing platform. The MiCS algorithm is conductive to processing a large amount and strong repeatability of mutual-information calculation among feature pairs by applying a column-wise transformation scheme. And to improve the efficiency of the MiCS and the utilization rate of Spark cluster resources, we adopt a virtual partitioning scheme to achieve balanced load while mitigating the data skewness problem in the Spark Shuffle process.

查看原文本刊更多论文

大分类数据互信息计算及其在特征分组中的应用

本文在Spark计算平台上开发了一个用于大分类数据互信息的并行计算系统MiCS。MiCS算法采用逐列转换方案，有利于处理大量、可重复性强的特征对互信息计算。为了提高mic的效率和Spark集群资源的利用率，我们采用虚拟分区方案来实现负载均衡，同时缓解Spark Shuffle过程中的数据偏度问题。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE 36th International Conference on Data Engineering (ICDE)

自引率

0.00%

发文量