{"title":"Associative Operator Precedence Parsing: A Method To Increase Data Parsing Parallelism","authors":"Le Li, K. Taura","doi":"10.1145/3578178.3578233","DOIUrl":null,"url":null,"abstract":"Many data often come with a high volume in textual format (JSON, XML, CSV). Because parsing can easily dominate data analysis time, researchers have been working on parallelizing parsing. Operator Precedence Parsing (OPP), among candidate parsing methods, is amenable to parallelization, with a practical algorithm proposed. The “locally parsable” property allows the parser to deduce if a reduction is safe with limited context. However, when the grammar has productions that tend to produce a highly skewed parse tree, OPP raises reductions mostly in serial, and the parsing still suffers from a long critical path. In pactice, OPP has little or even no speedup when parsing data because data often contain high percentage of parallel elements (e.g., JSON array elements separated by commas) produced from such productions, a situation that frequently occurs when processing big data. To address this issue and scale textual data parsing, we propose a parsing algorithm that lifts the restriction of deterministic parsing. For an ambiguous grammar, the parser non-deterministically produces a subtree for parallel elements. Such parsers can still produce deterministic semantics when the operator that connects these subtrees is considered associative for data analysis (e.g., map-union). We thus name the algorithm Associative OPP (AOPP), where parsing a large sequence of parallel elements can enjoy much parallelism as reductions can happen in any order. We show that AOPP is of practical use and scales in most cases through textual data parsing.","PeriodicalId":314778,"journal":{"name":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3578178.3578233","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Many data often come with a high volume in textual format (JSON, XML, CSV). Because parsing can easily dominate data analysis time, researchers have been working on parallelizing parsing. Operator Precedence Parsing (OPP), among candidate parsing methods, is amenable to parallelization, with a practical algorithm proposed. The “locally parsable” property allows the parser to deduce if a reduction is safe with limited context. However, when the grammar has productions that tend to produce a highly skewed parse tree, OPP raises reductions mostly in serial, and the parsing still suffers from a long critical path. In pactice, OPP has little or even no speedup when parsing data because data often contain high percentage of parallel elements (e.g., JSON array elements separated by commas) produced from such productions, a situation that frequently occurs when processing big data. To address this issue and scale textual data parsing, we propose a parsing algorithm that lifts the restriction of deterministic parsing. For an ambiguous grammar, the parser non-deterministically produces a subtree for parallel elements. Such parsers can still produce deterministic semantics when the operator that connects these subtrees is considered associative for data analysis (e.g., map-union). We thus name the algorithm Associative OPP (AOPP), where parsing a large sequence of parallel elements can enjoy much parallelism as reductions can happen in any order. We show that AOPP is of practical use and scales in most cases through textual data parsing.