Plex: Scaling Parallel Lexing with Backtrack-Free Prescanning

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2021-05-01 DOI:10.1109/IPDPS49936.2021.00079

Le Li, Shigeyuki Sato, Qiheng Liu, K. Taura

{"title":"Plex: Scaling Parallel Lexing with Backtrack-Free Prescanning","authors":"Le Li, Shigeyuki Sato, Qiheng Liu, K. Taura","doi":"10.1109/IPDPS49936.2021.00079","DOIUrl":null,"url":null,"abstract":"Lexical analysis, which converts input text into a list of tokens, plays an important role in many applications, including compilation and data extraction from texts. To recognize token patterns, a lexer incorporates a sequential computation model – automaton as its basic building component. As such, it is considered difficult to parallelize due to the inherent data dependency. Much work has been done to accelerate lexical analysis through parallel techniques. Unfortunately, existing attempts mainly rely on language-specific remedies for input segmentation, which makes it not only tricky for language extension, but also challenging for automatic lexer generation. This paper presents Plex – an automated tool for generating parallel lexers from user-defined grammars. To overcome the inherent sequentiality, Plex applies a fast prescanning phase to collect context information prior to scanning. To reduce the overheads brought by prescanning, Plex adopts a special automaton, which is derived from that of the scanner, to avoid backtracking behavior and exploits data-parallel techniques. The evaluation under several languages shows that the prescanning overhead is small, and consequently Plex is scalable and achieves 9.8-11.5X speedups using 18 threads.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS49936.2021.00079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Lexical analysis, which converts input text into a list of tokens, plays an important role in many applications, including compilation and data extraction from texts. To recognize token patterns, a lexer incorporates a sequential computation model – automaton as its basic building component. As such, it is considered difficult to parallelize due to the inherent data dependency. Much work has been done to accelerate lexical analysis through parallel techniques. Unfortunately, existing attempts mainly rely on language-specific remedies for input segmentation, which makes it not only tricky for language extension, but also challenging for automatic lexer generation. This paper presents Plex – an automated tool for generating parallel lexers from user-defined grammars. To overcome the inherent sequentiality, Plex applies a fast prescanning phase to collect context information prior to scanning. To reduce the overheads brought by prescanning, Plex adopts a special automaton, which is derived from that of the scanner, to avoid backtracking behavior and exploits data-parallel techniques. The evaluation under several languages shows that the prescanning overhead is small, and consequently Plex is scalable and achieves 9.8-11.5X speedups using 18 threads.

查看原文本刊更多论文

Plex:缩放并行词法与回溯-自由预扫描

词法分析将输入文本转换为标记列表，它在许多应用程序中起着重要作用，包括从文本中编译和提取数据。为了识别标记模式，词法分析器将顺序计算模型-自动机作为其基本构建组件。因此，由于固有的数据依赖性，很难并行化。通过并行技术加速词法分析已经做了很多工作。不幸的是，现有的尝试主要依赖于特定于语言的输入分割补救措施，这不仅使语言扩展变得棘手，而且对自动词法生成也具有挑战性。本文介绍了Plex——一个从用户定义语法生成并行词法器的自动化工具。为了克服固有的顺序性，Plex应用快速预扫描阶段在扫描之前收集上下文信息。为了减少预扫描带来的开销，Plex采用了一种特殊的自动机，该自动机来源于扫描仪，以避免回溯行为并利用数据并行技术。在几种语言下的评估表明，预扫描开销很小，因此Plex是可扩展的，使用18个线程实现9.8-11.5倍的速度提升。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

自引率

0.00%

发文量