Measuring the expressive power of practical regular expressions by classical stacking automata models

IF 1 4区计算机科学 Q3 COMPUTER SCIENCE, THEORY & METHODS

Information and Computation Pub Date : 2025-04-29 DOI:10.1016/j.ic.2025.105303

Taisei Nogami , Tachio Terauchi

{"title":"Measuring the expressive power of practical regular expressions by classical stacking automata models","authors":"Taisei Nogami , Tachio Terauchi","doi":"10.1016/j.ic.2025.105303","DOIUrl":null,"url":null,"abstract":"<div><div>A <em>rewb</em> is a regular expression extended with a feature called backreference. It is broadly known that backreference is a practical extension of regular expressions, and is supported by most modern regular expression engines, such as those in the standard libraries of Java, Python, and more. Meanwhile, <em>indexed languages</em> are the languages generated by indexed grammars, a formal grammar class proposed by A.V. Aho. We show that these two models' expressive powers are related in the following way: every language described by a rewb is an indexed language. As the smallest formal grammar class previously known to contain rewbs is the class of context sensitive languages, our result strictly improves the known upper-bound. Moreover, we prove the following four claims: (1) there exists a rewb whose language does not belong to the class of stack languages, which is a proper subclass of indexed languages, (2) the language described by a rewb without a captured reference is in the class of nonerasing stack languages, which is a proper subclass of stack languages, (3) there exists a rewb that describes a stack language but not a nonerasing stack language, and (4) a rewb extended with another practical extension called lookaheads can describe a non-indexed language. Finally, we show that the hierarchy investigated in a prior study, which separates the expressive power of rewbs by the notion of nested levels, is within the class of nonerasing stack languages.</div></div>","PeriodicalId":54985,"journal":{"name":"Information and Computation","volume":"305 ","pages":"Article 105303"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Computation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0890540125000392","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

A rewb is a regular expression extended with a feature called backreference. It is broadly known that backreference is a practical extension of regular expressions, and is supported by most modern regular expression engines, such as those in the standard libraries of Java, Python, and more. Meanwhile, indexed languages are the languages generated by indexed grammars, a formal grammar class proposed by A.V. Aho. We show that these two models' expressive powers are related in the following way: every language described by a rewb is an indexed language. As the smallest formal grammar class previously known to contain rewbs is the class of context sensitive languages, our result strictly improves the known upper-bound. Moreover, we prove the following four claims: (1) there exists a rewb whose language does not belong to the class of stack languages, which is a proper subclass of indexed languages, (2) the language described by a rewb without a captured reference is in the class of nonerasing stack languages, which is a proper subclass of stack languages, (3) there exists a rewb that describes a stack language but not a nonerasing stack language, and (4) a rewb extended with another practical extension called lookaheads can describe a non-indexed language. Finally, we show that the hierarchy investigated in a prior study, which separates the expressive power of rewbs by the notion of nested levels, is within the class of nonerasing stack languages.

查看原文本刊更多论文

用经典堆叠自动机模型测量实用正则表达式的表达能力

rewb是一个正则表达式，扩展了一个称为反向引用的特性。众所周知，反向引用是正则表达式的一种实用扩展，大多数现代正则表达式引擎都支持它，比如Java、Python等标准库中的正则表达式引擎。索引语言是由A.V. Aho提出的一种形式化语法类——索引语法生成的语言。我们证明了这两个模型的表达能力之间的关系如下：rewb描述的每一种语言都是索引语言。由于已知包含rewb的最小形式语法类是上下文敏感语言类，因此我们的结果严格提高了已知的上界。此外，我们证明了以下四个说法：(1)存在一个rewb，其语言不属于堆栈语言类，堆栈语言类是索引语言的适当子类；(2)没有捕获引用的rewb所描述的语言属于非擦除堆栈语言类，堆栈语言类是堆栈语言的适当子类；(3)存在一个描述堆栈语言但不属于非擦除堆栈语言的rewb；(4)用另一种称为lookaheads的实用扩展扩展的rewb可以描述非索引语言。最后，我们表明在先前的研究中调查的层次结构，通过嵌套层次的概念分离rewb的表达能力，属于非擦除堆栈语言类。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Information and Computation 工程技术-计算机：理论方法

CiteScore

2.30

自引率

0.00%

发文量

119

审稿时长

140 days

期刊介绍： Information and Computation welcomes original papers in all areas of theoretical computer science and computational applications of information theory. Survey articles of exceptional quality will also be considered. Particularly welcome are papers contributing new results in active theoretical areas such as -Biological computation and computational biology- Computational complexity- Computer theorem-proving- Concurrency and distributed process theory- Cryptographic theory- Data base theory- Decision problems in logic- Design and analysis of algorithms- Discrete optimization and mathematical programming- Inductive inference and learning theory- Logic & constraint programming- Program verification & model checking- Probabilistic & Quantum computation- Semantics of programming languages- Symbolic computation, lambda calculus, and rewriting systems- Types and typechecking