Interactive data science at scale

David A. Bader
{"title":"Interactive data science at scale","authors":"David A. Bader","doi":"10.1145/3457388.3459985","DOIUrl":null,"url":null,"abstract":"A real-world challenge in data science is to develop interactive methods for quickly analyzing new and novel data sets that are potentially of massive scale. In this talk, we discuss our development of suffix array and graph algorithms in the context of Arkouda, a NumPy-like replacement for interactive data science on tens of terabytes of data. Many real-world applications in bioinformatics, web information search and analysis, and lossless compression can be abstracted as string analysis. Suffix arrays are a very efficient data structure to support quick search of any string patterns. We have integrated the suffix array data structure (including its enhanced Longest Common Prefix (LCP) array) and the corresponding construction algorithm into Arkouda, thus providing Python users with a powerful method to support different types of string analysis. Our novel approach integrates a suffix array algorithm library into Arkouda so that the Arkouda runtime can select the large suffix array construction algorithm dynamically based on the dataset properties. Two of the implemented methods on the back-end of Arkouda include our novel O(n) time complexity skew algorithm in Chapel, and the DivSufSoft suffix array construction algorithm in C, which has higher time complexity but often is faster in practice. Experimental results show that, supported by Arkouda, Python users can build a large scale string's suffix array and LCP array in a Jupyter notebook easily without losing any performance compared with the directly back-end operation. Our future work is extending our self-developed algorithm to support multi-locale parallel execution, so that our algorithm can handle large strings on distributed systems. Graphs are widely used to abstract problems in domains such as social sciences, biological systems, and information systems. To support real-world large graph analysis in Arkouda, we first developed the array-based graph data structure which can be used like an adjacency matrix or incidence matrix but with much less memory. At the same time, it naturally works well with Arkouda's array operators. Based on this succinct graph data structure, we have developed two typical graph algorithms, breadth-first search (BFS) and triangle counting. Both algorithms have been successfully integrated into Arkouda. Both are multi-locale algorithms so they can handle a very large graph on distributed systems. Experimental results of BFS on a 32-node cluster system show that our method can build large graphs into distributed memory and execute the parallel BFS algorithm on typical sparse graph benchmarks and R-MAT generator-based graphs successfully. The performance results show that the distributed graph building time and BFS time will increase linearly with the total number of edges. For future work, we will further optimize these graph algorithms and investigate the streaming versions in Arkouda. We acknowledge Mike Merrill and Bill Reus, the founding developers of the open-source Arkouda framework. This is joint work with research scientist Dr. Zhihui Du, and doctoral student Oliver Alvarado Rodriguez. Bader is supported in part by the National Science Foundation award CCF-2109988.","PeriodicalId":136482,"journal":{"name":"Proceedings of the 18th ACM International Conference on Computing Frontiers","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 18th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3457388.3459985","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

A real-world challenge in data science is to develop interactive methods for quickly analyzing new and novel data sets that are potentially of massive scale. In this talk, we discuss our development of suffix array and graph algorithms in the context of Arkouda, a NumPy-like replacement for interactive data science on tens of terabytes of data. Many real-world applications in bioinformatics, web information search and analysis, and lossless compression can be abstracted as string analysis. Suffix arrays are a very efficient data structure to support quick search of any string patterns. We have integrated the suffix array data structure (including its enhanced Longest Common Prefix (LCP) array) and the corresponding construction algorithm into Arkouda, thus providing Python users with a powerful method to support different types of string analysis. Our novel approach integrates a suffix array algorithm library into Arkouda so that the Arkouda runtime can select the large suffix array construction algorithm dynamically based on the dataset properties. Two of the implemented methods on the back-end of Arkouda include our novel O(n) time complexity skew algorithm in Chapel, and the DivSufSoft suffix array construction algorithm in C, which has higher time complexity but often is faster in practice. Experimental results show that, supported by Arkouda, Python users can build a large scale string's suffix array and LCP array in a Jupyter notebook easily without losing any performance compared with the directly back-end operation. Our future work is extending our self-developed algorithm to support multi-locale parallel execution, so that our algorithm can handle large strings on distributed systems. Graphs are widely used to abstract problems in domains such as social sciences, biological systems, and information systems. To support real-world large graph analysis in Arkouda, we first developed the array-based graph data structure which can be used like an adjacency matrix or incidence matrix but with much less memory. At the same time, it naturally works well with Arkouda's array operators. Based on this succinct graph data structure, we have developed two typical graph algorithms, breadth-first search (BFS) and triangle counting. Both algorithms have been successfully integrated into Arkouda. Both are multi-locale algorithms so they can handle a very large graph on distributed systems. Experimental results of BFS on a 32-node cluster system show that our method can build large graphs into distributed memory and execute the parallel BFS algorithm on typical sparse graph benchmarks and R-MAT generator-based graphs successfully. The performance results show that the distributed graph building time and BFS time will increase linearly with the total number of edges. For future work, we will further optimize these graph algorithms and investigate the streaming versions in Arkouda. We acknowledge Mike Merrill and Bill Reus, the founding developers of the open-source Arkouda framework. This is joint work with research scientist Dr. Zhihui Du, and doctoral student Oliver Alvarado Rodriguez. Bader is supported in part by the National Science Foundation award CCF-2109988.
大规模的交互式数据科学
数据科学面临的现实挑战是开发交互式方法来快速分析潜在的大规模新数据集。在这次演讲中,我们将讨论我们在Arkouda背景下后缀数组和图形算法的发展,Arkouda是一个类似numpy的交互式数据科学的替代品,可以处理数十tb的数据。在生物信息学、网络信息搜索和分析以及无损压缩等现实世界中的许多应用都可以抽象为字符串分析。后缀数组是一种非常有效的数据结构,支持对任何字符串模式的快速搜索。我们将后缀数组数据结构(包括其增强的LCP数组)和相应的构造算法集成到Arkouda中,从而为Python用户提供了一种强大的方法来支持不同类型的字符串分析。该方法将后缀数组算法库集成到Arkouda中,使Arkouda运行时能够根据数据集属性动态选择大后缀数组构建算法。Arkouda后端实现的两种方法包括我们在Chapel中新颖的O(n)时间复杂度倾斜算法,以及C中的DivSufSoft后缀数组构建算法,该算法具有更高的时间复杂度,但在实践中通常更快。实验结果表明,在Arkouda的支持下,Python用户可以轻松地在Jupyter笔记本中构建大规模的字符串后缀数组和LCP数组,而与直接后端操作相比,不会损失任何性能。我们未来的工作是扩展我们自己开发的算法,以支持多语言环境并行执行,这样我们的算法就可以处理分布式系统上的大字符串。图被广泛用于抽象社会科学、生物系统和信息系统等领域的问题。为了在Arkouda中支持现实世界中的大型图形分析,我们首先开发了基于数组的图形数据结构,它可以像邻接矩阵或关联矩阵一样使用,但内存要少得多。同时,它可以很好地与Arkouda的数组操作符配合使用。基于这种简洁的图数据结构,我们开发了两种典型的图算法:广度优先搜索(BFS)和三角形计数。这两种算法都已成功集成到Arkouda中。两者都是多区域算法,因此它们可以在分布式系统上处理非常大的图。BFS在32节点集群系统上的实验结果表明,我们的方法可以将大型图构建到分布式内存中,并成功地在典型的稀疏图基准测试和基于R-MAT生成器的图上执行并行BFS算法。性能结果表明,分布式图的构建时间和BFS时间随着边缘总数的增加呈线性增加。在未来的工作中,我们将进一步优化这些图算法,并研究Arkouda中的流版本。我们感谢开源Arkouda框架的创始开发者Mike Merrill和Bill Reus。这是研究科学家杜志辉博士和博士生奥利弗·阿尔瓦拉多·罗德里格斯的共同工作。Bader获得了国家科学基金CCF-2109988的部分资助。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信