Information-based massive data retrieval method based on distributed decision tree algorithm

Bin Chen, Qingming Chen, Peishan Ye
{"title":"Information-based massive data retrieval method based on distributed decision tree algorithm","authors":"Bin Chen, Qingming Chen, Peishan Ye","doi":"10.1142/s1793962322430024","DOIUrl":null,"url":null,"abstract":"Based on the distributed decision tree algorithm, this paper first proposes a method of vertically partitioning datasets and synchronously updating the hash table to establish an information-based mass data retrieval method in a heterogeneous distributed environment, as well as using interval segmentation and interval filtering technologies for improved algorithm of distributed decision tree. The distributed decision tree algorithm uses the attribute histogram data structure to merge the category list into each attribute list, reducing the amount of data that needs to reside in the memory. Second, we adopt the strategy of vertically dividing the dataset and synchronously updating the hash table, select the hash table entries that can be used to update according to the minimum Gini value, modify the corresponding entries and use the hash table to record and control each sub-site. In the case of node splitting, it has a high accuracy rate. In addition, for classification problems that meet monotonic constraints in a distributed environment, this paper will extend the idea of building a monotonic decision tree in a distributed environment, supplementing the distributed decision tree algorithm, adding a modification rule and modifying the generated nonmonotonic decision tree to monotonicity. In order to solve the high load problem of the privacy-protected data stream classification mining algorithm under a single node, a Storm platform for the parallel algorithm PPFDT_P based on the distributed decision tree algorithm is designed and implemented. At the same time, considering that the word vector model improves the deep representation of features and solves the problem of feature high-dimensional sparseness, and the iterative decision tree algorithm GBDT model is more suitable for non-high-dimensional dense features, the iterative decision tree algorithm will be integrated into the word vector model (GBDT) in the data retrieval application, using the distributed representation of words, namely word vectors, to classify short messages on the GBDT model. Experimental results show that the distributed decision tree algorithm has high efficiency, good speed-up and good scalability, so that there is no need to increase the number of datasets at each sub-site at any time. Only a small number of data items are inserted. By splitting some leaf nodes, a small amount is added by branching to achieve a monotonic decision tree. The proposed system achieves a massive data ratio of 54.1% while compared with other networks of massive data ratio.","PeriodicalId":13657,"journal":{"name":"Int. J. Model. Simul. Sci. Comput.","volume":"31 1","pages":"2243002:1-2243002:20"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Int. J. Model. Simul. Sci. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s1793962322430024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

Based on the distributed decision tree algorithm, this paper first proposes a method of vertically partitioning datasets and synchronously updating the hash table to establish an information-based mass data retrieval method in a heterogeneous distributed environment, as well as using interval segmentation and interval filtering technologies for improved algorithm of distributed decision tree. The distributed decision tree algorithm uses the attribute histogram data structure to merge the category list into each attribute list, reducing the amount of data that needs to reside in the memory. Second, we adopt the strategy of vertically dividing the dataset and synchronously updating the hash table, select the hash table entries that can be used to update according to the minimum Gini value, modify the corresponding entries and use the hash table to record and control each sub-site. In the case of node splitting, it has a high accuracy rate. In addition, for classification problems that meet monotonic constraints in a distributed environment, this paper will extend the idea of building a monotonic decision tree in a distributed environment, supplementing the distributed decision tree algorithm, adding a modification rule and modifying the generated nonmonotonic decision tree to monotonicity. In order to solve the high load problem of the privacy-protected data stream classification mining algorithm under a single node, a Storm platform for the parallel algorithm PPFDT_P based on the distributed decision tree algorithm is designed and implemented. At the same time, considering that the word vector model improves the deep representation of features and solves the problem of feature high-dimensional sparseness, and the iterative decision tree algorithm GBDT model is more suitable for non-high-dimensional dense features, the iterative decision tree algorithm will be integrated into the word vector model (GBDT) in the data retrieval application, using the distributed representation of words, namely word vectors, to classify short messages on the GBDT model. Experimental results show that the distributed decision tree algorithm has high efficiency, good speed-up and good scalability, so that there is no need to increase the number of datasets at each sub-site at any time. Only a small number of data items are inserted. By splitting some leaf nodes, a small amount is added by branching to achieve a monotonic decision tree. The proposed system achieves a massive data ratio of 54.1% while compared with other networks of massive data ratio.
基于分布式决策树算法的信息海量数据检索方法
本文首先在分布式决策树算法的基础上,提出了一种垂直划分数据集并同步更新哈希表的方法,建立了异构分布式环境下基于信息的海量数据检索方法,并利用区间分割和区间过滤技术对分布式决策树算法进行了改进。分布式决策树算法使用属性直方图数据结构将类别列表合并到每个属性列表中,减少了需要驻留在内存中的数据量。其次,我们采用垂直划分数据集并同步更新哈希表的策略,根据最小Gini值选择可用于更新的哈希表条目,修改相应的条目,并使用哈希表记录和控制每个子站点。在节点分裂的情况下,具有较高的准确率。此外,对于在分布式环境下满足单调约束的分类问题,本文将扩展在分布式环境下构造单调决策树的思想,对分布式决策树算法进行补充,增加修改规则,将生成的非单调决策树修改为单调。为了解决单节点下隐私保护数据流分类挖掘算法的高负载问题,设计并实现了基于分布式决策树算法的并行算法PPFDT_P的Storm平台。同时,考虑到词向量模型提高了特征的深度表示,解决了特征高维稀疏性问题,而迭代决策树算法GBDT模型更适合非高维密集特征,在数据检索应用中,将迭代决策树算法集成到词向量模型(GBDT)中,采用词的分布式表示,即词向量,在GBDT模型上对短信进行分类。实验结果表明,分布式决策树算法具有效率高、加速性好、可扩展性好等特点,无需随时增加每个子站点的数据集数量。只插入少量的数据项。通过分割一些叶节点,通过分支增加少量的叶节点,形成单调决策树。与其他大数据比网络相比,本系统实现了54.1%的大数据比。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信