Multiple lexicalisation (a Java based study)

Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering Pub Date : 2019-10-20 DOI:10.1145/3357766.3359532

E. Scott, A. Johnstone

{"title":"Multiple lexicalisation (a Java based study)","authors":"E. Scott, A. Johnstone","doi":"10.1145/3357766.3359532","DOIUrl":null,"url":null,"abstract":"We consider the possibility of making the lexicalisation phase of compilation more powerful by avoiding the need for the lexer to return a single token string from the input character string. This has the potential to empower language design by softening the boundaries between lexical and phrase level specification. The large number of lexicalisations makes it impractical to parse each one individually, but it is possible to share the parsing of common subparts, reducing the number of tokens parsed from the product of the token numbers associated with the components to their sum. We report total numbers of lexicalisations of example Java strings, and the impact on these numbers of various lexical disambiguation strategies, and we introduce a new generalised parsing technique that can efficiently parse multiple lexicalisations of character string simultaneously. We then use this technique on Java, reporting on the number of lexicalisations that correspond to syntactically correct Java strings and the degree to which the standard Java lexer is safe in the sense that it does not remove all the syntactically correct lexicalisations of an input character string. Our multi-lexer parser is an alternative to scannerless parsing of a character level grammar, retaining the separation between grammar terminals and the corresponding lexical tokens. This has the advantages of allowing the parser to use terminal level lookahead and keeping lexical level disambiguation separate from the context free grammar.","PeriodicalId":354325,"journal":{"name":"Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3357766.3359532","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

We consider the possibility of making the lexicalisation phase of compilation more powerful by avoiding the need for the lexer to return a single token string from the input character string. This has the potential to empower language design by softening the boundaries between lexical and phrase level specification. The large number of lexicalisations makes it impractical to parse each one individually, but it is possible to share the parsing of common subparts, reducing the number of tokens parsed from the product of the token numbers associated with the components to their sum. We report total numbers of lexicalisations of example Java strings, and the impact on these numbers of various lexical disambiguation strategies, and we introduce a new generalised parsing technique that can efficiently parse multiple lexicalisations of character string simultaneously. We then use this technique on Java, reporting on the number of lexicalisations that correspond to syntactically correct Java strings and the degree to which the standard Java lexer is safe in the sense that it does not remove all the syntactically correct lexicalisations of an input character string. Our multi-lexer parser is an alternative to scannerless parsing of a character level grammar, retaining the separation between grammar terminals and the corresponding lexical tokens. This has the advantages of allowing the parser to use terminal level lookahead and keeping lexical level disambiguation separate from the context free grammar.

查看原文本刊更多论文

多重词汇化(基于Java的研究)

通过避免词法分析器从输入字符串返回单个令牌字符串，我们考虑了使编译的词法化阶段更强大的可能性。这有可能通过软化词汇级和短语级规范之间的界限来增强语言设计。大量的词法化使得单独解析每个词法化变得不切实际，但是可以共享公共子部分的解析，从而减少从与组件关联的令牌数的乘积到它们的总和解析的令牌数量。我们报告了示例Java字符串的词法化总数，以及各种词法消歧策略对这些数字的影响，并介绍了一种新的通用解析技术，该技术可以有效地同时解析字符串的多个词法化。然后，我们在Java上使用这种技术，报告与语法正确的Java字符串对应的词法化的数量，以及标准Java词法分析器的安全程度，因为它不会删除输入字符串的所有语法正确的词法化。我们的多词法分析器是字符级语法的无扫描器解析的替代方案，保留了语法终端和相应词法记号之间的分离。这样做的优点是允许解析器使用终端级的前瞻性，并保持词法级的消歧与上下文无关的语法分离。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 12th ACM SIGPLAN International Conference on Software Language Engineering

自引率

0.00%

发文量