1TODO

Summarize these highly relevant papers: [11], [1].

https://epsil.github.io/gll/

Combine [11] and [15] to principled context-sensitive stateful parsing-unparsing.

Why did Chomsky & Schützenberger 1959 [5] bother to present "grammars as generators of formal power series"? There are some citing papers. What is the significance?

Another algebra [10]?

Parsing by example [8]? What is the significance?

2What is the problem with parsing?

Alternative title: Criteria for desirable solutions to the parsing problem.

2.1Absence of suitable formalism

We need a simple and reliable way for programming stateful context-sensitive parsers; real world parsing is seldom stateless or context-free.12345

It seems to me that most parsing methods are clever but unprincipled engineering tricks in which mathematical justifications are afterthoughts, if there are even any at all. However, an example of the few principled approaches to parsing is Shieber 1995 deductive parsing [16].

Operator precedence parsing does not fit into context-free grammar formalism?

The problem is "we don’t know how to specify language syntax"6. The problem is to design a language that describes languages. To design a language specification language. To design a meta-language, for describing grammars, with practical operational semantics. This meta-language should handle hard things such as Haskell off-side rule and C/C++ syntax.

2.2Parsers are not composable

Are composable grammars and parsers possible?7

Prolog definite-clause grammar (DCG) formalism is almost a dream. It is declarative. It handles context-sensitive grammars. Its only weakness is its non-handling of left recursion. But it seems that every left-recursive grammar can be algorithmically transformed into an equivalent non-left-recursive grammar, so what's the problem? It shouldn't be too hard to implement a left recursion elimination algorithm for Prolog DCGs.

3How do we parse?

Grune & Jacobs 2008 book [7] is a comprehensive collection of parsing methods.

3.1How?

Zaytsev & Bagge 2014 [19] survey

[14]

[2]

Kourzanov 2014 [9] bidirectional parsing

[4]

somewhat unrelated [18]

[12]

Parsing is also called "syntax analysis" (analysis = breakdown, syntax = put together).

Parsing is the act of modifying the state of the parser. This is the operational view.

Parsing is converting a sequence to a tree. This is the data view.

What is the difference between syntax and grammar?

We lex (perform lexical analysis / tokenization) to clean up the grammar (no need to mention whitespaces in the grammar).

Lexing simplifies grammars.

With lexing:

exp ::= exp PLUS exp

Without lexing:

white ::= ...
exp ::= exp white "+" white exp

"Strictly speaking, tokenization may be handled by the parser. The reason why we tend to bother with tokenising in practice is that it makes the parser simpler, and decouples it from the character encoding used for the source code." (Wikibooks:Compiler construction)

Parsing is transforming a list into a tree.

Stand on the shoulders of giants. 2012 timeline of parsing. https://jeffreykegler.github.io/personal/timeline_v3

partial parsing; wrong formatting http://www.vinartus.net/spa/94j.pdf

Deep: "Partial evaluation can turn a general parser into a parser generator." "The Essence of LR Parsing" Sperber_Thiemann_The_essence_of_LR_parsing.pdf

See the forest, not only the trees.

Some parsing techniques:

3.2Incremental/online parsing

How do IDEs not have to reparse the entire document when the user presses one keystroke?

Incremental parsing is parsing as input becomes available (without waiting for the whole input to become available).

An incremental parser is a relation \(step \subseteq C \times T \times T\).

The idea is to output to all possible continuations? \(incrementalize : (C^* \to T) \to (C^* \to T^*)\)?

3.3How should we generate parsers and unparsers from grammars?

What we are interested in is how to specify grammar, and how to derive a parser and unparser from grammar specificiation.

I expect the computer to infer a parser and a pretty-printer from the same grammar. Parser generators only give half of what I want.

I expect the computer to work with non-ambiguous left-recursive grammars.

How should parsing be done? From grammar description, the machine should generate both a parser and a pretty-printer.

Given grammar, generate both parser and unparser/pretty-printer.

3.4What parsing techniques/formalisms are there?

There are many techniques/formalisms:

  • Prolog definite-clause grammar (DCG) rules
  • Haskell parser combinators
  • continuation-based parsing
  • parser generators

Prolog DCG is interesting because it is often reversible: the same code often gives us both a parser and an unparser.

Logically, a production (a syntax rule) is a predicate (relation) of arity 2. That is, the rule Exp ::= Num Op Num is logically the Horn-clause exp(A,D) :- num(A,B), op(B,C), num(C,D).

The application of a rule to an input-list produces a syntax object and a remaining-list. A syntax object contains the name of the rule that produces it, the part of the input that matches it, the input position, and so on. We can make this with SWI-Prolog dicts.

We can use Scheme continuation for backtracking like Prolog.

3.4.1Syntax objects?

The application of a rule to an input-list produces a syntax object and a remaining-list. A syntax object contains the name of the rule that produces it, the part of the input that matches it, the input position, and so on. We can make this with SWI-Prolog dicts.

3.4.2Reversible programming? Bidirectional programming?

Example: If \(T\) is a terminal, then the nonterminal \(N \to T\) is invertible. To parse, remove the prefix matching T from the input list. To unparse, prepend T to the input list.

If the rules \(A\) and \(B\) are invertible, then the concatenation nonterminal \(N \to AB\) is invertible.

Thus we say the relation cons/3 is invertible: cons(H,T,[H|T]).

We want something similar to Rendel & Ostermann 2010 [15], but in Prolog instead of Haskell.

Given view : D -> V and modv : V -> V, the interpreter should be able to infer modd : D -> D.

modd = through view modv

Boomerang language?

Benjamin C. Pierce 2006 "The Weird World of Bi-Directional Programming"8

Wikipedia9

Janus1011

3.5How do we relate CST and AST without clutter?

Big problems in parsing: lossless clutterless relation between CST and AST.

3.6<2018-11-02> Direct left-recursive parsers in Prolog

The key: unify terminals before recursing into nonterminals.

% S is a list of character codes.
binary_operator([0'+]).
binary_operator([0'*]).

digit(C) :- code_type(C, digit).

number(S) :-
    digit([S])
;   append([[A], B], S), digit(A), number(B);

expression(S) :-
    number(S)
;   binary_operator(B), append([A, B, C], S), expression(A), expression(C).

3.7Relational parsing; parsing with Prolog

Parsing is turning a list into a tree.

3.7.1Approaches

3.7.2Determining the groundness of the length of the lists involved in append/3 and append/2

  1. Why do we care?

    Because we want to write naive parsers that terminate.

  2. What?

    From the source code of SWI-Prolog, with some modifications:

    "Ground" here is an adjective, not a noun. A term is ground iff it has no variables. A term is non-ground otherwise.

    We say that a list is length-ground iff its length is ground, and length-unground otherwise. The elements don't have to be ground.

    • The empty list is length-ground.
    • A list [_|T] is length-ground iff T is length-ground.
    • If a variable gets unified with a length-ground list, then the variable is length-ground.

    To analyze length-groundedness, we "reverse" the program.

    % append(T, L, R)
    append([], L, L).
    append(T, L, R) => append([H|T], L, [H|R]).
    

    (Length-ground = proper list?)

    Now we can infer these about append(T, L, R):

    • If T = [], then L and R have the same length-groundness.
    • The recursive case:
      • Iff T is length-ground, then [H|T] is length-ground.
      • Iff R is length-ground, then [H|R] is length-ground.
    • If we want L to be length-ground, then R has to be length-ground.
    • Thus we can infer that L and R have the same length-groundness regardless of the length-groundness of T.

    If append(A, B, C) succeeds, then:

    • If A = [], then B and C have the same length-groundness.
    • If two of A, B, C are length-ground, then the other one is length-ground?
    • If two of A, B, C are length-unground, then the other one is length-unground?

    What?

  3. How do we generate a long list in Prolog, for testing?

    1. How do we say "A is a list of 100 equal elements" in Prolog?

3.7.3Naive approach with recognizer / membership predicate

A recognizer is a unary predicate that takes a list of character codes.

Another possible names for recognizer are acceptor, determiner, decider, membership predicate.

Example: The following digit predicate recognizes ASCII decimal digits.

We can build recognizers on other recognizers. For example, here we use digit to define number_:

That Prolog knowledge base corresponds to this context-free grammar:

digit ::= <a digit character as defined by Unicode>
number ::= digit | digit number

Exercise:

  • Here you will compare depth-first search and iterative deepening search, and understand search completeness.
  • Try the query number_(S).
  • Try the query length(S,_), number_(S).
  • If you keep pressing semicolon in the first query, will you ever encounter S = [48,49]?
  1. A cool thing: recognizers are generators.

    The predicate number_ can be used not only to recognize strings, but also to generate all such strings.

    To understand how that works, we have to understand Prolog backtracking.

  2. Left recursion thwarts the naive approach.

    Problem: The following expression doesn't terminate.

    The corresponding context-free grammar is left-recursive:

    expression ::= number | expression operator expression
    

    We don't want to sacrifice the elegance of the description.

  3. Can memoization (tabling) help speed up the naive approach?

    No.

  4. Another naive approach that works.

    This one works.

    The key is:

    • Put grounding goals first. A grounding goal is a goal that grounds its variables.
    • Be careful with the pattern g, u where g generates ungrounded terms and u fails, because it may cause infinite loop when Prolog backtracks, because Prolog continues to generate fresh variables. For example, this doesn't terminate:

      • If p may generate infinite choice points, then p, fail doesn't terminate.

3.7.4Prefix remover / difference-list recognizer / list partitioner

We can turn the naive recognizer digit/1 into difference-list recognizer digit/2.

  • The first parameter is the input string, say Input.
  • The second parameter is the recognized prefix of Input.
  • The third parameter is the unrecognized suffix of Input.

In the following, P stands for Parsed, and U stands for Unparsed.

We can turn the recognizer into:

The meaning of number_(U0, P0, U1) is:

  • P0 is a number.
  • P0 is a prefix of U0.
  • U0 is the concatenation of P0 and U1.

Observe how we "thread" the state. The calls in the body follow the pattern something(U<n>, P<n>, U<n+1>).

We can translate a recognizer into a difference-list recognizer.

The cool thing is that each parameter works both ways.

  • The query string_codes("123", A), number_(A, A, []) asks Prolog to find out whether "123" parses as a number.
  • The query length(A, _), number_(A, A, []). asks Prolog to find a string that parse as a number. You can keep pressing ; to generate the next strings.

3.7.5Definite clause grammars

  • The DCG clause left --> right desugars/expands/translates into the definite clause left(U0, U1) :- ... where:
    • U0 is the input.
    • U1 is the suffix of U0 that is not recognized by the DCG clause.
    • The string recognized by the clause is the difference between U0 and U1. That string is the P such that U0 = P + U1 where + denotes list concatenation.
  • "Interesting Things about Prolog" https://gist.github.com/CMCDragonkai/89a6c502ca7272e5e7464c0fc8667f4d
    • "Definite clause grammars (DCG) make the difference list pattern into a first class primitive with the --> operator."
  1. Why does this naive DCG fail?

3.7.6Context-sensitive grammars?

We can add context by adding parameter.

3.7.7Libraries?

3.7.8Left recursion

Mathematics handles left recursion just fine. Computers should too. We shouldn't chicken out. We shouldn't compromise by working around our grammar descriptions.

3.7.9Precedence parsing?

3.8Meta-interpreter for left-recursive parsing?

"Parsing with left-recursive grammars" https://www.metalevel.at/acomip/

3.9What is left-recursion, and how should we handle it?

3.9.1Should we blame left-recursion on naive operational semantics?

Mathematics has no problem with left-recursion. Why should computers have problem with left-recursion?

3.9.2Handling left-recursion

Laurent and Mens 2016 [11] (some emphasis ours): "When a parser invokes itself (either directly or indirectly through intermediate parsers) without intervening state changes, the result is an infinite loop of parser invocations. This is a well-known problem of top-down recursive parsers, called left-recursion. Fortunately, it can be mitigated as follows: start by running the left-recursive parser while failing all left-recursive invocations, then re-run it, using the result of the initial parse as the result of all left-recursive invocations."

Avoiding left-recursion means always consuming something before recursing.

3.9.3Left-recursive parsing

2009 Direct Left-Recursive Parsing Expressing Grammars https://www.semanticscholar.org/paper/Direct-Left-Recursive-Parsing-Expressing-Grammars-Tratt/b1e8309db5537fb15f51071fcdc39e139659ed15

2008 Packrat Parsers Can Support Left Recursion

Naive recognizer + memoization

list_not_empty

Consume before recursing?

We can't piggyback Prolog's unification for lambda calculus substitution, because Prolog unifies same-named variables while lambda-calculus shadows same-named variables.

If the recursive call has smaller arguments than the parent call does, then the predicate should terminate.

3.10Inconclusive

1997 inconclusive discussion "Prolog Parser in Prolog" https://dtai.cs.kuleuven.be/projects/ALP/newsletter/archive_93_96/net/grammars/parser.html

3.11Parsing

"Parsing in Prolog" http://www.cs.sfu.ca/~cameron/Teaching/383/DCG.html

"Jacc's LR-Parsing with Dynamic Operators" "This part of the Jacc documentation explains the modifications we can make to a basic table-driven LR parser generator à la yacc to accommodate support for Prolog's dynamic operators." http://www.hassan-ait-kaci.net/hlt/doc/hlt/jaccdoc/dynamicLR.html

3.12Parsing with Brzozowski quotients

3.12.1What Brzozowski quotients?

Parsing with Brzozowski quotients1213 [13]. Does the general parser community understand that? How do we implement it in Prolog?

Some things I find interesting from [13]:

  • Kleene fixed-point theorem has a practical application.

Equational theories? Now that's a principled parsing.

The result of left-dividing a language \(L\) by a string \(c\) is \( c \backslash L = \SetBuilder{w}{cw \in L} \). [3] [13]

Atoms14

Differentiating Parsers15

Might et al.'s 2011 pearl [13] is not in Grune & Jacobs's 2008 book [7], but the book does mention Brzozowski. Brzozowski's idea goes back to his 1964 paper [3]. But who would have thought of adding laziness and memoization on top of it, and generalize it to context-free grammars? (Formal definition?)

3.12.2What?

3.12.3Digression: How is the Brzozowski derivative a derivative?

Why does Brzozowski 1964 [3] calls it derivative if it is actually a quotient? The article contains has no explanation, so here goes our guess.

The Brzozowski derivative of language \(R\) with respect to string \(s\) is written \(D_s R\) and is \(\SetBuilder{t}{st \in R}\) [3]. The \(D_a\) in equation 3.7 in that article would indeed be a differential-algebraic derivation16 (generalized product rule) if it lost the \(\delta(P)\) term: \[ D_a(PQ) = (D_a P) Q + \delta(P) D_a Q \] But it is a derivative (a "derivation" in spirit), and there are other things called derivatives that do not exactly satisfy the product rule either. It makes sense for regular expressions to have derivatives, because regular expressions can be studied from abstract-algebra perspective, because regular expressions form an algebra as defined in Brzozowski 1964 section 2 (although he does not explicitly mention that the structure is indeed an algebra).

A corollary: what is the integral? Smith & Yau 1972 [17] defines an integral counterpart to Brzozowski derivatives. But is that integral really an anti-derivative?

Another corollary: Under what conditions do quotients form derivations/derivatives?

3.12.4What again?

The multiplication of two strings \(x\) and \(y\) is the concatenation \(x \cdot y = x y\).

Multiplication is associative: \((xy)z = x(yz)\).

The inverse of a string \(x\) is written \(x^{-1}\). It's hypothetical. It's pure symbolic manipulation. Don't imagine what it looks like. Do care about its properties:

  • We define \(x^{-1} x = \epsilon\).
  • We define \(x x^{-1} = \epsilon\).
  • We define \((x y)^{-1} = x^{-1} y^{-1}\).

The left division of a string \(x\) by divisor \(y\) is \(y^{-1} x\).

The right division of a string \(x\) by divisor \(y\) is \(x y^{-1}\).

How do we define quotient and remainder?

The Brzozowski derivative is a quotient17, because it is the result of dividing a language (a set of strings) by a string.

3.12.5Regular expressions can be extended to context-free expressions by adding a fixed-point expression involving a binder

\( \mu a . b \).

3.13Relational parsing

3.13.1What?

Recall that a relation is a triple that consists of domain, codomain, and pairing.

A grammar \(G\) can be thought of as a relation between the set \(F\) of forms and the set \(M\) of meanings: \(G \subseteq F \times M\).

In computer-language parsing, usually the form set \(F = C^*\) is the set of character strings, and the meaning set \(M\) is the set of syntax tree nodes.

Viewing grammar as relation leads to writing parsers as logic programs, which are almost synonymous with relational programs.

Shieber, Schabes, & Pereira 1995 [16] sees parsing as deduction. It sees parsing from proof-theory point of view. It presents a proof-theoretic framework that unifies several parsing algorithms (CYK, Earley, etc.). It implies that we can use a theorem prover for parsing. But should we?

The correspondence: one Chomsky production rule corresponds to one Horn clause with two parameters (input and rest/unparsed). P(A,B) means that the rule P matches the prefix of A that B lacks.

A DCG predicate can be thought of as a relation between two strings. \( P \subseteq C^* \times C^* \).

A grammar relation is a relation \(G \subseteq C^* \times T\). The set \(C\) is the alphabet. The set \(C^*\) is the Kleene closure of \(C\). The set \(T\) is the set of syntax trees.

Let \(G\) be a grammar.

We say that a string \(S\) is grammatical with respect to \(G\) iff there exists a tree \(T\) such that \(G(S,T)\). We may omit "with respect to \(G\)" if it is clear from context that there is only one grammar.

Iff the grammar relation is a function, then we say that the grammar is unambiguous.

3.13.2History of DCG?

DCG evolved from Colmerauer's "metamorphosis grammar"?

3.14String predicates are not enough

We want the parse tree, not just a yes-no answer.

At the most basic level we have string predicates. Similar concepts include functions from String to Bool, recognizers (in automata theory), acceptors, membership functions, deciders.

It is well known that there is a correspondence between a predicate \( \Phi \) and the set \( \SetBuilder{x}{\Phi(x)} \).

There is a simple algebra of string predicates, because there is the Boolean algebra for propositional logic. If each of \( p \) and \( q \) is a string predicate, then \( \neg p \) is also a string predicate, and \( p \vee q \) is also a string predicate.

3.15<2019-04-20> The most elegant parsing method so far

So far I think Prolog DCG (definite-clause grammar) is the most elegant, although still unsatisfactory due to its inability to handle left recursion, although such left recursion can be manually transformed into a right recursion.

Brzozowski derivatives are interesting but I have yet to find an understandable explanation for context-free grammars.

GLL parsing may be promising; it handles left recursion.

4On the relationship between grammar and parsing

Grammars are the what.

Parsing is the how.

A grammar is a mathematical description. "Formal grammars are mathematical models of syntax."18

A parser is a computer program that implements a grammar.

The arrow in \( A \to BC \) means "reduces to".

4.1How do we specify formal grammars in English?

We do it like this:

An expression is any of these:

  • a number; or
  • an expression, followed by an operator, folowed by an expression.

An example of context-sensitive grammars is C type/identifier:

How do we translate such description into mathematical formalism and then into a computer program?

Examples of formalisms: BNF (Backus–Naur Form), PEG (Parsing Expression Grammar).

4.2What is parsing a formal language?

A formal language can be thought of as a set of strings.

In 1550, "to parse" is "to state the parts of speech in a sentence".19

Parsing is converting string to tree.

A string is a homogeneous sequence.

A tree may be represented by a list of lists.

Parsing is relating strings and trees. Parsing is creating a tree from a string.

What is an alphabet? It may be the set of Unicode character code points. It may be the set of the tokens that a lexical analyzer may produce.

A parser implements a grammar, as a machine implements an algorithm.

A lexer is a degenerate20 parser whose codomain is a list (which is a degenerate tree).

The parser is parallelizable if there exists a relatively fast function \(combine\) such that for all \(x,y \in C^*\): \[ P(xy) = combine(P(x), P(y)) \]

4.3The other side: Inferring grammar from a parser?

Given a parser, can we derive/infer the grammar it implements?

Is it easier than turning a hamburger back into a cow?

4.4Can we derive a parser-unparser pair from a grammar?

Bison and Yacc are examples of deriving parsers from restricted grammars.

5On the relationship between grammar and algebras

Regular expressions form a nice algebra.

Does context-free grammars form an algebra? Can we do it without binders?

The Algebraic Theory of Context-Free Languages http://www-igm.univ-mlv.fr/~berstel/Mps/Travaux/A/1963-7ChomskyAlgebraic.pdf

Algebra and language theory http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.99.2415&rep=rep1&type=pdf

General context-free recognition in less than cubic time21, from22

5.1Can we extend Brzozowski derivatives to context-sensitive expressions?

Context-free expression is regular expression plus fixed points.

A context-sensitive rule has a left-hand side that may contain more than one non-terminal. An example of such rule is \(AB \to C\).

6On the possible mathematical foundations for parsing

6.1Parsing with logic

See also: Shieber 1995 deductive parsing [16].

A parser is a relation between strings and trees.

Such parser is unambiguous iff it is a function.

An unparser is a relation between trees and strings.

A string can be thought of a relation between natural numbers and characters.

A tree can be thought of as a "child-of" endorelation of nodes. Thus "\(T(x,y)\)" means "the node \(x\) is a child of the node \(y\)".

A node is a tuple \((t,x,y)\) where \(t\) is the node type, \(x\) the string begin index, and \(y\) the string end index.

A non-terminal corresponds to a binary predicate.

6.2Parsing with mostly algebra and some calculus

A Typed, Algebraic Approach to Parsing https://www.cl.cam.ac.uk/~nk480/parsing.pdf

Chomsky-style formalism?

Regular.

A character \(c\) means the character itself.

\(A \cdot B\) means sequencing.

\(A + B\) means choice.

\(\mu A. B\), fixed-point?

6.3Functional foundations?

The Kleene star constructs an infinite set of strings from a finite set of alphabets.

\( A^* = \bigcup_{n \in \Nat} A^n \)

Trees also have constructors, algebraic data type constructors.

The set of parse trees is \( T = \Nat \times A^* \).

An unparser is a function \( u : T \to A^* \).

An unambiguous parser is a function \( p : A^* \to T \).

Stateful parser \( S \times A^* \to S \times T \).

The natural number selects the data constructor.

What is the standard term for a function from String to Bool? Recognizers/acceptors/deciders/membership-functions/string-predicates?

"Is this string in the language?"

7On left recursion and its solution

7.1What is the problem of left recursion?

To appreciate the severity of this problem, first be impressed by the elegance of declarative parsing methods such as Prolog DCGs or Haskell parser combinators, and then try to implement a simple arithmetic expression parser, and then be disappointed by infinite recursion.

7.2What causes left recursion to be a problem for various declarative parsing methods?

Most declarative parsing methods (Prolog DCGs, Haskell monadic parser combinators) cannot handle left recursion because we pick a simple but unsuitable operational semantics for those declarative languages.

The grammar itself prescribes no inherent sequence of parsing. The defect is introduced when the grammar is translated into a parser.

All programming is eventually procedural.

Parsers that associate parsing a sequence of nonterminals with calling a sequence of subroutines will fail for left-recursion or right-recursion or both.

Why can't Prolog DCG handle left recursion? Can we fix it by prescribing a different operational semantics?

7.3What is the meaning of a left-recursive non-terminal anyway?

Consider the rule \( A \to A a | b \).

7.4How do we solve it?

There are several choices:

  • Give up, and manually eliminate left recursion.
  • Improve existing formalisms.
  • Find another formalism (or perhaps an operational semantics) that may be totally different but absolutely handles all recursion (left recursion, right recursion, mutual recursion, etc.).

7.5Alternative operational semantics: thread-spawning/asynchrony instead of subroutine-calling

CPS transformation, then inversion of control / callback.

The semantics of a non-terminal is a stream of matches.

This can also be used as an alternative operational semantics to Prolog so that Prolog can handle left-recursive predicates while sacrificing execution order predictability.

This doesn't work; this would just spawn infinite threads instead of making infinite calls, and computers are finite. This is just exacerbating the resource exhaustion problem by parallelization.

It seems that the core problem is the operational semantics of fixed points.

7.6Unsatisfactory manual left-recursion elimination

It is possible to manually eliminate left recursion by rewriting all rules of the form \(A \to AB | C\) to \(A \to C B^*\) where each of \(A\) and \(B\) is an expression that does not begin with \(A\). I have an example in parse_manual.pro.

Can we do better? Not about automating this manual left-recursion elimination, but about finding a formalism that can inherently handle left recursion.

Are we OK with manual transformations? There are not many left-recursive rules in practice.

For the computer, our manual transformation is a perfectly fine solution.

The grammar must not have a nullable left-recursive rule like \( A \to A \) or \( A \to \epsilon^* \). Otherwise a computer running a naive top-down left-to-right parsing algorithm is doomed into infinite loop. But we can argue that the only \(A\) satisfying \(A \to A\) is \(epsilon\), and that \( \epsilon^* = \epsilon \).

Two problems arise:

  • What about the parse tree? We want a parser, not a matcher.
  • Can it be automated?

Why do we care about left recursion? Grune & Jacobs 2008 sums it up: "Basically almost all parsing is done by top-down search with left-recursion protection"[7].

We are interested in eliminating left recursion from Prolog definite-clause grammars (DCGs).

to-do: summarize:

I got this idea for left-recursion elimination on <2019-02-20>, but this may be well-known.

7.6.1What is left recursion?

This is a grammar with three left-recursive non-terminals.

\[\begin{align*} A &\to B | C \\ B &\to Ab | b \\ C &\to Bc | c \end{align*} \]

We say that \(A\) left-calls \(B\) iff there exists a reduction \(A \to B C\).

A non-terminal \(A\) is left-recursive iff it may reduce to something beginning with itself. For example, the following rule \(A\) is left-recursive.

\[\begin{align*} A &\to B \\ B &\to \epsilon | AC \end{align*} \]

The left-call graph. Each vertex represents a non-terminal. An edge \((A,B)\) represents that \(A\) left-calls \(B\).

If the left-call graph is cyclic, then a top-down parser may not work.

Left-recursion elimination is about breaking cycles in the left-call graph.

How do we delete the minimum number of edges from a graph to make it acyclic? Is this problem NP-hard?

7.6.2Semiring of languages

We care about algebra because it guides us to correct algorithms.

A semiring is—roughly—an additive group, a multiplicative group, and an interaction between addition and multiplication.

The alphabet is \(A\). It is a finite set.

The semiring's underlying set is \(A^*\).

The languages of the same alphabet form a semiring.

0 is the empty set.

1 is \(\Set{\epsilon}\), the language that consists of the empty string only.

Addition is set union.

Multiplication is language concatenation: \(AB = \SetBuilder{ab}{a \in A, b \in B}\).

7.6.3Production rule, language endofunction, and least fixed point

We can think of a production rule as a language endofunction. For example, we can think of the rule \(A \to \epsilon | a A\) as the function \(A \mapsto 1 + \Set{a} A\). Then, we can think of the language described by the rule as the least fixed point of the corresponding function, that is, the smallest set such that \(A = 1 + \Set{a} A\).

If a rule is non-recursive, then the corresponding language endofunction is a constant function that does not depend on the parameter.

7.6.4Factoring finite left-recursion

Conjecture: Every finite left-recursive rule can be factored into the form \(A \to AB | C\) such that the rule \(A \to C\) would not be left-recursive.

Example of infinite left recursion: \(A \to Aa\). It matches an infinite string of \(a\).

7.6.5Left-recursive language

Because every rule can be factored as above, it suffices us to consider the least fixed point of the function \( A \mapsto AB + C \).

We obtain the least fixed point by inferring the pattern formed by repeatedly replacing \(A = AB+C\) and manipulating the equation.

\[\begin{align*} A &= AB+C \\ A &= (AB+C)B + C \\ A &= ABB + CB + C \\ A &= (AB+C)BB + CB + C \\ A &= ABBB + CBB + CB + C \\ A &= \ldots + CB^3 + CB^2 + CB^1 + CB^0 \\ A &= \sum_{k\in\Nat} CB^k \\ A &= C \sum_{k\in\Nat} B^k \\ A &= C B^* \end{align*} \]

It turns out that \( lfp(A \mapsto AB + C) = C B^* \).

Because we are not using extended context-free grammar (which would have regular expressions and the Kleene star), we have to introduce an auxiliary non-terminal \(A'\) for representing \(B^*\):

\[\begin{align*} A &= C A' \\ A' &= 1 + BA' \end{align*} \]

Observe that \(A' = B^*\).

\[\begin{align*} A' &= 1 + BA' \\ A' &= 1 + B(1 + BA') \\ A' &= 1 + B(1 + B(1 + BA')) \\ A' &= \sum_{k\in\Nat} B^k \end{align*} \]

7.6.6Left-recursion elimination algorithm

The algebra leads us to this left-recursion elimination algorithm:

  1. Remove the original rule for the left-recursive non-terminal \(A\) from the grammar.
  2. Factor that original rule into the form \(A \to AB | C\) such that \(A \to C\) would not be left-recursive and would not be empty. If this is impossible, tell the user about the infinite left recursion. Do not add \(A \to AB | C\) to the grammar; this rule is only an intermediate product.
  3. Add these two rules to the grammar: \(A \to C A'\) and \(A' \to \epsilon | B A'\).

We have just eliminated left-recursion in a principled way, in a provably language-preserving way, guided by algebra. Now we understand why it works. If we forget the algorithm, we can always derive it from the algebra.

Example:

Original left-recursive rule:
exp :- num ; "(", exp, ")" ; exp, "*", exp ; exp, "+", exp

After factoring (A :- ...) into (A :- A,B ; C):
exp :- exp, ("*", exp ; "+", exp) ; (num ; "(", exp, ")")

After replacement:
exp :- (num ; "(", exp, ")"), exp0
exp0 :- "" ; ("*", exp ; "+", exp), exp0

7.6.7Inlining the auxiliary rule's parse tree

Two grammars describing the same language may produce different parse trees.

Unfortunately left-recursion elimination changes the syntax tree. How do we unchange it?

7.6.8TODO Prolog implementation

Write a Prolog program to eliminate left recursion from definite-clause grammars.

The logical meaning of the Prolog DCG rule \(A(x) \to B_1(x), \ldots, B_n(x)\) is the predicate \(A\) where \(A(x,s_1,s_{n+1}) \leftarrow ( B_1(x,s_1,s_2) \wedge \ldots \wedge B_n(x,s_n,s_{n+1}) )\).

7.6.9Reverse parsing

parse((A,B),C) iff parse(r((A,B)),r(C)).

where r((A,B)) = r(B),r(A).

Reversing the parser makes it right-to-left top-down parser. It can now handle left-recursion, but it can now not handle right-recursion.

7.7GLL parsers?

"To handle left-recursive grammars, we need to reimagine the parser combinators."23

8On the almost-duality of parsing and unparsing

Partial isomorphisms [15] and the biased choice operator [12].

8.1What is the inverse of parsing?

The inverse of parsing is unparsing (tree linearization).

A reverse of parsing is grammar inference, that is to find a grammar that produces a given set of sentences [7].

Parsing is the treeization (delinearization, deserialization) of a line. Unparsing is the linearization (serialization) of a tree.

Parsing is String -> Maybe Tree. Unparsing is Tree -> String.

Can we make parsing truly one-to-one? String -> Tree. CST = AST. Very rigid syntax. Forbid whitespace freedom.

Another possibility: Inverse of parsing is anti-parsing (generation)? From grammar, generate all possible strings and their syntax trees.

Inverse of analytical grammar is generative grammar?

Parser is syntax analyzer. Analysis is the opposite of synthesis? What is syntax synthesizer?

Inverse of parsing is pretty-printing?

If matching is analogous to subtraction, then what is analogous to multiplication? Generation?

9The semantics of grammar expressions

Consider the expression X,Y. Declaratively it means X followed by Y. Operationally it means match X then match Y.

10On the set-theoretic fixed-point semantics of context-free non-terminals

<2019-09-05>

Notation: Let \(X \cdot Y = \SetBuilder{xy}{x \in X, ~ y \in Y}\). We call this the "concatenating product".

Suppose that \(f:Set \to Set\) is a set endofunction defined using only set literals, concatenating products, unions, and recursive references to its argument.

Conjecture: Such \(f\) has exactly one fixed point, and this algorithm computes that fixed point:

procedure fixed_point(f) {
    var x := empty_set
    loop {
        let y = f(x)
        if x = y { return x }
        else { x := y }
    }
}

Hunch: We can use that to make a parser, because a non-terminal rule can be thought of as a set endofunction like that.

For example, consider the rule \( R : A \to A b | c \). The rule can be thought as the function \( \hat{R}(\hat{A}) = (\hat{A} \cdot \Set{b}) \cup \Set{c} \). The non-terminal \(A\) can be thought of as representing the set \( \hat{A} \) satisfying the fixed-point equation \( \hat{A} = (\hat{A} \cdot \Set{b}) \cup \Set{c} \)

The non-terminal \(A\) can be thought of as representing the fixed point of the function \(\hat{R}\).

11What ideas?

11.1Possible-prefix incremental parsing?

Given a string S, find all rules that may match a string that begins with S.

11.2Techniques?

In recursive descent parsing24, the program procedures mirror the grammar rules. Backtracking unreads the input (places the input back into a queue).

"How should I specify a grammar for a parser?"25

Naive parser with memoization.

TODO sample Leiss's book "Language equations"

11.3<2019-07-31> Begin with an unparser, and invert it?

Idea: Can we write an unparser, and derive a parser? Because unparsing (Tree -> String) is easier than parsing (String -> Tree).

Interesting article with very relevant introduction: [6] mentions [12]:

Matsuda and Wang (2013) attack the problem in a somewhat different way. They show how, starting from an extended pretty-printer, one can use program inversion techniques to automatically derive a parser that satisfies the round-tripping property (assuming that the underlying grammar is unambiguous). A typical pretty-printer does not contain enough information to construct a parser, so they introduce a biased choice operator […]

String = Char*

Node = (Type, String, Node*)

-- How do we invert this homomorphism?
f (Append a b) = mappend (f a) (f b)
-- ???
g Plus = "+"
g Times = "*"

-- How do we invert this function?
f (Exp op a b) = g op <> " " <> f a <> " " <> f b

Inverting a binary function produces a ternary relation.

11.4Idea: incremental update of the set of all currently possible parse trees?

11.5Procedural parsing?

Parsing machine primitives:

  • Read the next character from the input port.
  • Set the input position.
  • Compare characters.

How do those primitives compose?

11.6Language-oriented approach

The language-oriented approach to parsing is to make a language for expressing a relation between strings and trees.

The structure of the concrete syntax tree reflects the structure of the grammar production rules.

Example: a regular expression is a DSL for string matching / pattern matching / parsing.

12Conferences

ACM SIGPLAN SLE http://www.sleconf.org/blog/11-20-2013-parsing-at-sle-2013

13Politics of parsing

This patent (US patent 6449589, "Elimination of left recursion from context-free grammars")2627 should not exist?

14Declarative Programming Research Group

14.1Transitive closure

There are several ways of thinking about transitivity: relation, logic, graph, fixed-point, and limit.

A relation \(R\) is transitive iff \(\forall x \forall y \forall z [(R(x,y) \wedge R(y,z)) \to R(x,z)]\).

The transitive closure of a relation \(R\) is the smallest transitive superrelation of \(R\). Such closure is obtained by adding the fewest number of edges to make \(R\) transitive.

The transitive closure of an arity-2 predicate \(P\) is \(T(P)\) where \(T(P,x,y) = P(x,y) \vee \exists i (P(x,i) \wedge T(P,i,y)) \). The transitive closure of a first-order logic predicate is a first-order logic predicate. But the transitive closure operator \(T\) is a second-order logic predicate. Fagin 1974 proves that transitive closure makes first-order logic more expressive.28

The transitive closure of an arity-2 relation \(R\) is \(R \cup R_2 \cup R_3 \cup \ldots = \bigcup_{k \in \Nat \ge 1} R_k\) where \(R_k = \underbrace{R \circ \ldots \circ R}_k\). But this assumes that the relation is countable. If the relation is finite and its domain has \(n\) elements, then \(k\) does not need to go higher than \(n-1\), because the shortest path between two connected vertices in that graph will have at most \(n-1\) edges. Thus \(T(R)\) is the smallest set that satisfies the equation \(T(R) = T(R) \cup R\). Thus \(T(R)\) is the least fixed point of the function \(A \mapsto (A \cup R)\).

We can also think of transitive closure of \(R\) as the limit \(\lim_{n\to\infty} S_n\) But this also assumes that the relation is countable. where \(S_1 = R\) and \(S_{n+1} = (S_n \circ R) \cup R\) and \((B,C,S) \circ (A,B,R) = (A,C,\SetBuilder{(x,z)}{\exists x (R(x,y) \wedge S(y,z))})\). We can think of \(S_n\) as the set of paths whose length does not exceed \(n\).

14.2Transitive closure in Prolog

This naïve Prolog predicate t/2 may not terminate if the graph represented by edge is cyclic. Direct translation of the logical formula does not work.

t(A,B) :- edge(A,B).
t(A,C) :- edge(A,B), t(B,C).

How do we make Prolog smarter so that the above predicate t/2 terminates?

15Bibliography

[1] Afroozeh, A. and Izmaylova, A. 2015. One parser to rule them all. 2015 acm international symposium on new ideas, new paradigms, and reflections on programming and software (onward!) (2015), 151–170. url: <https://2015.onward-conference.org/details/onward2015-papers/11/One-Parser-to-Rule-Them-All>.

[2] Alimarine, A. et al. 2005. There and back again: Arrows for invertible programming. Proceedings of the 2005 acm sigplan workshop on haskell (2005), 86–97. url: <https://www.cs.ru.nl/~marko/research/pubs/2005/bi-arrows.pdf>.

[3] Brzozowski, J.A. 1964. Derivatives of regular expressions. Journal of the ACM. 11, (1964), 481–494. url: <http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.98.4378>.

[4] Caballero, R. and López-Fraguas, F.J. 1999. A functional-logic perspective of parsing. International symposium on functional and logic programming (1999), 85–99. url: <https://pdfs.semanticscholar.org/a87f/56ad2113060650afb230aa9182cbe845041e.pdf>.

[5] Chomsky, N. and Schützenberger, M.P. 1959. The algebraic theory of context-free languages. Studies in logic and the foundations of mathematics. Elsevier. 118–161. url: <http://www-igm.univ-mlv.fr/~berstel/Mps/Travaux/A/1963-7ChomskyAlgebraic.pdf>.

[6] Danielsson, N.A. 2013. Correct-by-construction pretty-printing. Proceedings of the 2013 acm sigplan workshop on dependently-typed programming (2013), 1–12. url: <http://www.cse.chalmers.se/~nad/publications/danielsson-correct-pretty.pdf>.

[7] Grune, D. and Jacobs, C.J. 2008. Parsing techniques: A practical guide. (2008).

[8] Kobel, M. et al. 2005. Parsing by example. Institut fur Informatik und angewandte Mathematik. (2005). url: <https://www.inf.usi.ch/faculty/lanza/Downloads/Kobe05a.pdf>.

[9] Kourzanov, P. 2014. Bidirectional parsing a functional / logic perspective. (2014). url: <https://ifl2014.github.io/submissions/ifl2014_submission_18.pdf>.

[10] Krishnaswami, N.R. and Yallop, J. 2019. A typed, algebraic approach to parsing. Proceedings of the 40th acm sigplan conference on programming language design and implementation (2019), 379–393. url: <https://www.cl.cam.ac.uk/~nk480/parsing.pdf>.

[11] Laurent, N. and Mens, K. 2016. Taming context-sensitive languages with principled stateful parsing. Proceedings of the 2016 acm sigplan international conference on software language engineering (2016), 15–27. url: <http://arxiv.org/abs/1609.05365>.

[12] Matsuda, K. and Wang, M. 2013. FliPpr: A prettier invertible printing system. ESOP (2013). url: <http://www2.sf.ecei.tohoku.ac.jp/~kztk/papers/kztk_esop2013.pdf>.

[13] Might, M. et al. 2011. Parsing with derivatives: A functional pearl. ACM sigplan notices (2011), 189–195. url: <http://matt.might.net/papers/might2011derivatives.pdf>.

[14] Mu, S.-C. et al. 2004. An injective language for reversible computation. MPC (2004). url: <http://takeichi.ipl-lab.org/~scm/pub/reversible.pdf>.

[15] Rendel, T. and Ostermann, K. 2010. Invertible syntax descriptions: Unifying parsing and pretty printing. ACM sigplan notices (2010), 1–12. url: <http://www.informatik.uni-marburg.de/~rendel/unparse/rendel10invertible.pdf>.

[16] Shieber, S.M. et al. 1995. Principles and implementation of deductive parsing. The Journal of logic programming. 24, 1-2 (1995), 3–36. url: <https://www.eecs.harvard.edu/shieber/Biblio/Papers/infer.pdf>.

[17] Smith, L.W. and Yau, S.S. 1972. Generation of regular expressions for automata by the integral of regular expressions. The Computer Journal. 15, 3 (Aug. 1972), 222–228. url: <https://academic.oup.com/comjnl/article/15/3/222/480588>.

[18] Tan, G. and Morrisett, J.G. 2016. Bidirectional grammars for machine-code decoding and encoding. Journal of Automated Reasoning. 60, (2016), 257–277. url: <http://www.cse.psu.edu/~gxt29/papers/bigrammar_jar.pdf>.

[19] Zaytsev, V. and Bagge, A.H. 2014. Parsing in a broad sense. International conference on model driven engineering languages and systems (2014), 50–67. url: <https://bora.uib.no/bitstream/handle/1956/8938/zaytsev-bagge-models14-parsing.pdf>.


  1. http://trevorjim.com/parsing-not-solved/

  2. https://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt.html

  3. https://news.ycombinator.com/item?id=8505382

  4. http://lambda-the-ultimate.org/node/4489

  5. https://news.ycombinator.com/item?id=2327313

  6. http://trevorjim.com/parsing-not-solved/

  7. https://tratt.net/laurie/blog/entries/parsing_the_solved_problem_that_isnt.html

  8. https://www.cis.upenn.edu/~bcpierce/papers/lenses-etapsslides.pdf

  9. https://en.wikipedia.org/wiki/Bidirectional_transformation

  10. https://topps.diku.dk/pirc/?id=janus

  11. https://en.wikipedia.org/wiki/Janus_(time-reversible_computing_programming_language)

  12. http://matt.might.net/articles/parsing-with-derivatives/

  13. https://github.com/webyrd/relational-parsing-with-derivatives/blob/master/README.md

  14. https://blog.github.com/2018-10-31-atoms-new-parsing-system/

  15. http://lambda-the-ultimate.org/node/3704

  16. https://en.wikipedia.org/wiki/Derivation_(differential_algebra)

  17. https://en.wikipedia.org/wiki/Quotient_of_a_formal_language

  18. <2019-08-18> http://users.math-cs.spbu.ru/~okhotin/

  19. https://www.etymonline.com/word/parse

  20. https://en.wikipedia.org/wiki/Degeneracy_(mathematics)

  21. https://www.sciencedirect.com/science/article/pii/S0022000075800468

  22. https://math.stackexchange.com/questions/871151/what-type-of-algebra-are-valiants-algebras-for-context-free-grammars

  23. https://epsil.github.io/gll/

  24. https://en.wikipedia.org/wiki/Recursive_descent_parser

  25. https://softwareengineering.stackexchange.com/questions/107266/how-should-i-specify-a-grammar-for-a-parser

  26. http://www.freepatentsonline.com/6449589.html

  27. https://patents.google.com/patent/US6449589B1/

  28. https://en.wikipedia.org/wiki/Transitive_closure#In_logic_and_computational_complexity