Associative Operator Precedence Parsing: A Method To Increase Data Parsing Parallelism

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region Pub Date : 2023-02-27 DOI:10.1145/3578178.3578233

Le Li, K. Taura

{"title":"Associative Operator Precedence Parsing: A Method To Increase Data Parsing Parallelism","authors":"Le Li, K. Taura","doi":"10.1145/3578178.3578233","DOIUrl":null,"url":null,"abstract":"Many data often come with a high volume in textual format (JSON, XML, CSV). Because parsing can easily dominate data analysis time, researchers have been working on parallelizing parsing. Operator Precedence Parsing (OPP), among candidate parsing methods, is amenable to parallelization, with a practical algorithm proposed. The “locally parsable” property allows the parser to deduce if a reduction is safe with limited context. However, when the grammar has productions that tend to produce a highly skewed parse tree, OPP raises reductions mostly in serial, and the parsing still suffers from a long critical path. In pactice, OPP has little or even no speedup when parsing data because data often contain high percentage of parallel elements (e.g., JSON array elements separated by commas) produced from such productions, a situation that frequently occurs when processing big data. To address this issue and scale textual data parsing, we propose a parsing algorithm that lifts the restriction of deterministic parsing. For an ambiguous grammar, the parser non-deterministically produces a subtree for parallel elements. Such parsers can still produce deterministic semantics when the operator that connects these subtrees is considered associative for data analysis (e.g., map-union). We thus name the algorithm Associative OPP (AOPP), where parsing a large sequence of parallel elements can enjoy much parallelism as reductions can happen in any order. We show that AOPP is of practical use and scales in most cases through textual data parsing.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578178.3578233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Many data often come with a high volume in textual format (JSON, XML, CSV). Because parsing can easily dominate data analysis time, researchers have been working on parallelizing parsing. Operator Precedence Parsing (OPP), among candidate parsing methods, is amenable to parallelization, with a practical algorithm proposed. The “locally parsable” property allows the parser to deduce if a reduction is safe with limited context. However, when the grammar has productions that tend to produce a highly skewed parse tree, OPP raises reductions mostly in serial, and the parsing still suffers from a long critical path. In pactice, OPP has little or even no speedup when parsing data because data often contain high percentage of parallel elements (e.g., JSON array elements separated by commas) produced from such productions, a situation that frequently occurs when processing big data. To address this issue and scale textual data parsing, we propose a parsing algorithm that lifts the restriction of deterministic parsing. For an ambiguous grammar, the parser non-deterministically produces a subtree for parallel elements. Such parsers can still produce deterministic semantics when the operator that connects these subtrees is considered associative for data analysis (e.g., map-union). We thus name the algorithm Associative OPP (AOPP), where parsing a large sequence of parallel elements can enjoy much parallelism as reductions can happen in any order. We show that AOPP is of practical use and scales in most cases through textual data parsing.

查看原文本刊更多论文

关联运算符优先解析:一种增加数据解析并行性的方法

许多数据通常以文本格式(JSON、XML、CSV)大量出现。由于解析很容易占据数据分析时间，研究人员一直致力于并行解析。在候选解析方法中，运算符优先解析(OPP)具有并行化的特点，并提出了一种实用的算法。“本地可解析”属性允许解析器推断在有限的上下文下是否安全。但是，当语法产生的结果倾向于产生高度倾斜的解析树时，OPP主要是串行地提高缩减，并且解析仍然受到长关键路径的影响。在实践中，OPP在解析数据时几乎没有加速，因为数据通常包含由这些结果产生的高比例的并行元素(例如，用逗号分隔的JSON数组元素)，这种情况在处理大数据时经常发生。为了解决这个问题并扩展文本数据解析，我们提出了一种解析算法，该算法解除了确定性解析的限制。对于歧义语法，解析器不确定地为并行元素生成子树。当连接这些子树的操作符被认为是数据分析的关联符(例如，map-union)时，这样的解析器仍然可以产生确定性语义。因此，我们将该算法命名为关联OPP (AOPP)，在该算法中，解析大量并行元素序列可以享受到很多并行性，因为缩减可以以任何顺序发生。我们通过文本数据解析证明了AOPP在大多数情况下具有实际用途和规模。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

自引率

0.00%

发文量