自动机和规则语言的共词典排序。第1部分

IF 2.5 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Journal of the ACM Pub Date : 2022-08-09 DOI:10.48550/arXiv.2208.04931

Nicola Cotumaccio, G. D’Agostino, A. Policriti, N. Prezza

{"title":"自动机和规则语言的共词典排序。第1部分","authors":"Nicola Cotumaccio, G. D’Agostino, A. Policriti, N. Prezza","doi":"10.48550/arXiv.2208.04931","DOIUrl":null,"url":null,"abstract":"The states of a finite-state automaton 𝒩 can be identified with collections of words in the prefix closure of the regular language accepted by 𝒩. But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of articles automata admitting a total co-lexicographic (co-lex for brevity) ordering of states have been proposed and studied. Such class of ordered automata — Wheeler automata — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character. Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width p of one of their admissible co-lex partial orders–dubbed here the automaton’s co-lex width. We first show that this new measure captures at once the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width p: (i) has an equivalent powerset DFA whose size is exponential in p rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just Θ(log p) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to p2 per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in p, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small p. Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language ℒ–dubbed the Hasse automaton ℋ of ℒ–can be exhibited. ℋ provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting ℒ, as long as we want to maintain an operational link with the (co-lexicographic) order of ℒ’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.","PeriodicalId":50022,"journal":{"name":"Journal of the ACM","volume":"15 1","pages":"1 - 73"},"PeriodicalIF":2.5000,"publicationDate":"2022-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Co-lexicographically Ordering Automata and Regular Languages - Part I\",\"authors\":\"Nicola Cotumaccio, G. D’Agostino, A. Policriti, N. Prezza\",\"doi\":\"10.48550/arXiv.2208.04931\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The states of a finite-state automaton 𝒩 can be identified with collections of words in the prefix closure of the regular language accepted by 𝒩. But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of articles automata admitting a total co-lexicographic (co-lex for brevity) ordering of states have been proposed and studied. Such class of ordered automata — Wheeler automata — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character. Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width p of one of their admissible co-lex partial orders–dubbed here the automaton’s co-lex width. We first show that this new measure captures at once the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width p: (i) has an equivalent powerset DFA whose size is exponential in p rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just Θ(log p) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to p2 per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in p, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small p. Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language ℒ–dubbed the Hasse automaton ℋ of ℒ–can be exhibited. ℋ provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting ℒ, as long as we want to maintain an operational link with the (co-lexicographic) order of ℒ’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.\",\"PeriodicalId\":50022,\"journal\":{\"name\":\"Journal of the ACM\",\"volume\":\"15 1\",\"pages\":\"1 - 73\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2022-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the ACM\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2208.04931\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ACM","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2208.04931","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 9

摘要

有限状态自动机的状态可以用正则语言的前缀闭包中的单词集合来标识。但是单词是可以排序的，在许多可能的顺序中，一个非常自然的顺序是词典编纂顺序。这种自然源于这样一个事实，即它表明了从单词到自动机状态的顺序转移。事实上，这个建议是具体的，并且在许多文章中已经提出和研究了承认状态的完全共词典排序(简称共词典)的自动机。这类有序自动机——惠勒自动机——证明每次转换只需要固定数量的比特来表示，并且在每个匹配字符的恒定时间内支持正则表达式匹配查询。不幸的是，并非所有自动机都可以像前面所述的那样完全有序。在目前的工作中，我们提出了一个新的理论，表明所有自动机总是可以部分有序的，并且可以定义和有效地确定其复杂性的内在度量，即它们的一个可容许的协环偏序的最小宽度p -这里称为自动机的协环宽度。我们首先表明，这种新方法可以立即捕捉到自动机上几个看似不相关的难题的复杂性。任何协环宽度为p:(i)的NFA都有一个等效的幂集DFA，其大小在p上呈指数增长，而不是(如经典分析所示)在NFA的大小上呈指数增长;(ii)每个转换只使用Θ(log p)位进行编码;(iii)允许线性空间数据结构以与每个匹配字符p2成比例的时间解决正则表达式匹配查询。这种自动机的新参数化的一些结果是，PSPACE-hard问题，如NFA等价，在p中是FPT，而正则表达式匹配问题的二次下界对于足够小的p不成立。建立了自动机的协环宽度是一个基本的复杂性度量，我们通过(i)确定其计算复杂度和(ii)通过研究它们的最小宽度接受nfa和dfa，将这一概念从自动机扩展到常规语言。在本文中，我们重点讨论了确定性情况，并证明了一个正则最小宽度DFA可以被展示出来，该DFA接受一种语言，被称为Hasse自动机。从精确的意义上说，只要我们想要保持一个具有(共字典)次序的操作链接，那么对于任何接受(部分)排序的DFA的状态，h提供了最好的可能方法。最后，我们探讨了两个相互冲突的目标之间的关系:最小化DFA的宽度和最小化状态数。在这种情况下，我们为共字典顺序有序的正则语言提供了Myhill-Nerode定理的类比。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Co-lexicographically Ordering Automata and Regular Languages - Part I

The states of a finite-state automaton 𝒩 can be identified with collections of words in the prefix closure of the regular language accepted by 𝒩. But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of articles automata admitting a total co-lexicographic (co-lex for brevity) ordering of states have been proposed and studied. Such class of ordered automata — Wheeler automata — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character. Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width p of one of their admissible co-lex partial orders–dubbed here the automaton’s co-lex width. We first show that this new measure captures at once the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width p: (i) has an equivalent powerset DFA whose size is exponential in p rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just Θ(log p) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to p2 per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in p, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small p. Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language ℒ–dubbed the Hasse automaton ℋ of ℒ–can be exhibited. ℋ provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting ℒ, as long as we want to maintain an operational link with the (co-lexicographic) order of ℒ’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of the ACM 工程技术-计算机：理论方法

CiteScore

7.50

自引率

0.00%

发文量

审稿时长

3 months

期刊介绍： The best indicator of the scope of the journal is provided by the areas covered by its Editorial Board. These areas change from time to time, as the field evolves. The following areas are currently covered by a member of the Editorial Board: Algorithms and Combinatorial Optimization; Algorithms and Data Structures; Algorithms, Combinatorial Optimization, and Games; Artificial Intelligence; Complexity Theory; Computational Biology; Computational Geometry; Computer Graphics and Computer Vision; Computer-Aided Verification; Cryptography and Security; Cyber-Physical, Embedded, and Real-Time Systems; Database Systems and Theory; Distributed Computing; Economics and Computation; Information Theory; Logic and Computation; Logic, Algorithms, and Complexity; Machine Learning and Computational Learning Theory; Networking; Parallel Computing and Architecture; Programming Languages; Quantum Computing; Randomized Algorithms and Probabilistic Analysis of Algorithms; Scientific Computing and High Performance Computing; Software Engineering; Web Algorithms and Data Mining