Recursive Descent Parsing

 First Posted October 17, 2012

               Recursive Descent Parsing is a relatively easy to implement parsing strategy for making parsers of simple to moderate complexity.  It’s main limitation is the fact that generally these parsers are hand crafted.  There are parser generators that make recursive descent parsers, most notably ANTLR (PCCTS), and if you are really serious about getting a parser done, this is the way to go.  However, if you have masochistic tendencies, hand crafting one may be for you. 

                I am no parsing genius.  I have pretty much learned everything I know  by doing a lot of reading, and when that doesn’t make sense, trial and error is used.  So, what I am going to do here is show you what I have learned along the way that will hopefully let you write your own parsers.

                Now, one of the big hurdles is trying to learn BNF, which is the language that parsers are specified in.  There are all sorts of dialects of this notation, but they are all basically the same.  Let’s look at an example.

 

                Parse -> list eof

                list -> expr ; list

                        | empty

                expr -> expr + term

                        | expr term

                        | term

                Term -> Term * factor

                        | Term / factor

                …..| factor

                Factor -> ( expr )

                             | NUM

                ………..| ID

                Now, looking at this script, you most likely can follow what is going on and how this would be able to parse a series of expressions.  What you probably don’t know is that this grammar is left recursive.  That is bad for recursive descent parsers.  How is it possible to tell left recursion from a hole in the ground?

A -> A B  | A

In the above rule, A occurs not only of the left side of the rule, but on the right hand side, it also occurs at the left of each rule.

A -> B A | empty

In the above rule, A is not only on the left side, but in the right side, A only occurs to the right in the rules.  This is right recursion, and it is a good thing for recursive descent parsers.

                So, the big problem is, how do we go from left to right recursion.  That is not an easy question.  But, at least I know some rules to make it so.

A -> A B | C

is equivalent to

A-> C A’

A’ -> B A’ | empty

                The two new rules are not left recursive, however, it is not very intuitive that this is a true transformation.

                So, how does this work.  Getting rid of left recursion in the above grammar will turn it into this:

 

Parse -> list eof

List -> expr ; List

Expr-> Terms MoreExpr

MoreExpr -> + Terms MoreExpr

                    | - Terms MoreExpr

                    | empty

Terms -> Factors MoreTerms

MoreTerms -> * Factors MoreTerms

                         | / Factors MoreTerms

                         | empty

Factors -> ( expr )

               | NUM

              | ID

 

How exactly did we get here?

 

Let’s look just at the expr production from way aboe.

 

Expr -> Expr + Term

         | Expr – Term

         | Term

 

So A = Expr

    B = + Term (and also – Term)

   C = Term

  A’ will be MoreExpr

 

So, putting that altogether we get the following.:

 

Expr -> Term MoreExpr

 

MoreExpr -> + Term MoreExpr

                   | - Term MoreExpr

                  | empty

 

                I Hope that is clear as mud.

 

                This is something now that we can actually write some code for.  Example 1 is code for a simple algebraic to reverse polish translator.  You should be able to see the correspondence between the grammar and the C code that implements it.

    Running the program we get the following.  The first line is the input that we are going to parse.

2+4*4/2+7;
2
4
4
2
4 / 2 = 2
4 * 2 = 8
7
8 + 7 = 15
2 + 15 = 17      

    One interesting thing should be noted here, however.  If you were to generate a parser with an LALR(1) parser generator (such as yacc, bison, Anagram or LRstar) you would note that the associativity would be a lot different.  That parse would display left  associativity, however thanks to the transformation, we now have right associativity.  Isn't that a bummer.   

     Now, one of the problems with this code is that it does recurse, a lot!  In fact, every time it hits a semicolon, it will pretty much start a new set of recursions, until it hits an end of file marker.  This of course, does run the risk of running out of stack, eventually.  This will usually make things crash in the most gruesome way.

 

                If we were to look in the Dragon Book (Compilers Principles, Techniques, and Tools by Aho, Sethi and Ullman) around page 75, you will see an example that is similar to this one but is coded differently.

                Hmmm…..

                Now that we know how to translate a grammar with left recursion to one that has right recursion, and then translate that result into C code, let’s look at the problem again.  Let’s just look at the grammar for the Expr production.

 

Expr -> Expr + Term

         | Expr – Term

        | Term

 

Which translates to:

 

Expr -> Term MoreExpr

 

MoreExpr -> + Term MoreExpr

                     | - Term MoreExpr

                     | empty

 

And the C code:

 

void Expr()

{

                Terms();

                MoreExpr();

}

 

void MoreExpr()

{

                switch(lookahead)

                {

                                case '+':

                                                match('+');

                                                Terms();

                                                MoreExpr();

                                                printf("+\n");

                                                break;

                                case '-':

                                                match('-');

                                                Terms();

                                                MoreExpr();

                                                printf("-\n");

                                                break;

                }

}

 

Looking at these two C functions, those with a sharp eye can see that maybe some optimizing may be in order.  We can see that perhaps these two functions might be combined into one.

 

void Expr()

{

                Terms();

                while(1)

                {

                                switch(lookahead)

                                {

                                                case '+':

                                                                match('+');

                                                                Terms();

                                                                printf("+\n");

                                                                break;

                                                case '-':

                                                                match('-');

                                                                Terms();

                                                                printf("-\n");

                                                                break;

                                                default:                //empty rule

                                                                return;

                                                                break;

                                }

                }

}

 

Looking at this version of Expr, you will notice that it looks a lot like the one in the Dragon Book.  So, just to let you know, I am not at all talking about anything new under the sun.

    However, you will notice something that is different.  What happened to MoreExpr()?  That is because it is not really a mearger.  This function is actually based on this production:

    Expr -> Terms ( '+' Terms)*


    Uh....What exactly does this mean?  The parenthesies group the token + and the production Terms together.  The '*' means that we can have zero to whatever number of repetitions of this group.  But, how exactly did we get from here:

    Expr -> Expr + Terms
            -> Terms

to there?

    First off, this type of recursion is called tail recursion.  This is why the last MoreExpr() call disapeared.  It was the tail recursion call, and by it's nature, it can be replaced by a simple loop.  So, we lost the MoreExpr and gained a while.

                We can also see that there is a relation between this function and the LEFT recursive version of it.  It seems that perhaps it is more of a matter to look for rules to convert left recursion to C code than it is to convert left recursion to right recursion.

                It should also be noted that the right recursion is being handled by a loop rather than a recursive function call.  This has the advantage of not using up stack space.  This will lower the risk of a stack overflow.  See Example 2.