Part 02: Tutorial on lex/yacc.

by: Jonathan Engelsma

Download this transcript

Transcript:

[0.079]
hello I'm Jonathan Engel smih this is part two in our tutorial on learning Lex and yak today we're going to look at yak which is also known as bison we're going to talk about what exactly yak is we're going to talk about how it works and then we're going to go ahead and look at some sample code and see how we can use it to build a very simple language processor so in the previous tutorial we looked at the language processing stack and we said we looked at lexical analysis previously in this tutorial we're going to be looking at parsing or syntax analysis and the way we're using Lac here yak here we're also going to look at the semantic analysis and/or code generation so we're actually not going to generate code we're going to do we're going to interpret our input so these two blocks are kind of what we're going to be looking at in the sample code in our example program in a few minutes so yak and Lex are used together so you'll recall Lex was what we used for semantic analysis we could define some regular expressions and write some actions associated with those regular expressions and Lex would take an input source file and break it into a stream of tokens yak which by the way stands for yet another compiler compiler parses the input based on the extremest tokens that it's receiving from Lex and builds a parse tree and lets us do semantic processing in other words lets us execute code which gets at the semantics of whatever the language is that we're trying to implement now bison is simply the Ganu parser parser and it happens to be upwardly compatible with yak but it does a lot more than that in this tutorial we're just going to look at standard yak grammars so to get a better idea of how this works in this figure you can see at the very top we've got the expression a equals B plus C times D and if we through a lexical analyzer that we've generated with Lex assuming we have the right regular expression patterns it's going to give us a stream of tokens back that say something like an identifier then there'll be an equal token another identifier followed by a plus token another identifier followed by the multiplication followed by another identifier it's that stream of tokens and of course the values the exact values associated which of those token types that get fed into the syntax analysis stage and we're going to do syntax analysis using code that is generated by yak so the input that we're going to give yak is a context-free grammar our analyzer then that's generated for us is going to build this parse tree and as it builds it we're going to have an opportunity to hook up some actions that either generate code like we see in this example or as we'll see in the example we're going to demonstrate with simply interprets the meaning of that expression to get a little bit closer to the implementation what we're going to do is write a yak file and we're usually going to give these a dot Y extension so I might have a file called my Lang why that's going to define the grammar of my language and I'm going to feed that through yak and if I use the minus D option yak will generate a header file for me why dot tab dot H the ACK will always generate a C file or C module that contains my parser that recognizes the grammar I gave it and that's going to be in Y tab C now one of the things I can do in my my Lang dot y so in my yak input file is I can define the tokens the identity of my tokens as well as the types of my tokens and all of that information will get stored and wide apt to have that H if I generate it so when I write my Lex file which I place in my Langdell Lex is actually going to the my line that L is going to include the wide tabbed eh and so when Lex processes that that's how it's going to know about the different tokens that the grammar expects and as we've seen in the previous tutorial Lex generates the finite state machine that recognizes our tokens in a file called Lex dot YY dot C now we can take those two C source files and feed them through the new compiler and produce a executable program called my Lang and that is now a language processor that recognizes tokens defined in my lap my Langdell and recognizes sentences or expressions that are defined by the context-free grammar that I've specified in my Lang Y so in other words I can take source code of my language run it through this language processor and produce either the compiled code if I'm doing code generation or perhaps interpreter output if I'm implementing an interpreter so what does the Yak file look like well it looks very very similar to the Lex files that were already we've already looked at the first part of the file is separated by the middle part with a percent % and in the middle part of the file we're going to put our productions or grammar rules and with every production we define we can have an action written in C which can be a single statement scepter manator with a semicolon or if we've got more than one C statement we can put those in a code block and then we have a third part as well that's separated from the middle section with a percent % as usual the first part of our yak file contains the C declarations and just like in Lex we're going to enclose these with a percent left curly and terminate the section with a percent right curly and we can put two things in there our C declarations so different things that the actions are going to need and we can also put in this first part of the file some special yak definitions and some examples of these would be percent start so one of the rules or one of the production in the Yaak file is going to be the root or the first one and we designate that with a percent start and we also can define the different token identities or the token types that are coming back from the lexical analysis we'll use percent token to do that and in yak we could have tokens of different types and so in order to get the lexical analyzer to return tokens of different type we're going to use this percent Union definition and finally to represent the types that tokens can be we'll use percent type and we'll tie these all together and actually show you the syntax and how these are used in our code example in a moment the production's themselves in the middle section represent the context-free grammar of our language the left-hand side of a production of production also known as a non-terminal is going to be separated by a colon and followed by a right-hand side if we have multiple right hand sides following a non-terminal we can actually add those by using the vertical pipe and then finally for every production that we specify we can have actions associated with it so here's this very simple example of some productions and associated actions i have statements so that's my non-terminal colon statement okay statement itself is a non-terminal it's a production that's defined further down here and then the action follows that production so I'm going to simply print out the word statement if I recognize a statement but I'm not done there I have another production so statements also produces and then you'll see on the third line here my vertical pipe statement space statements and yes that's a recursive reference to the actual statements non-terminal that we're defining here if I see that then I have the action statements printed print out the word statements then I define the statement non-terminal as an identifier followed by a plus followed by another identifier and if I see something with that pattern I'm going to print out the word plus and I have another production statement call an identifier minus identifier and in that case I'm going to print F - and of course I could have used the vertical pipe there and save the typing of statement in the last rule now when we're writing the actions the actions in the previous example weren't very useful they just printed out some silly things not very useful at all but yak actually lets us access the values that are associated with the symbols in our rules and it does this with the notation dollar one dollar two and so forth up to dollar n where each of these refers to the token or the nut or the none of the terminal or non-terminal reference on the right-hand side of the rule and we also have dollar dollar which refers to the value of the left-hand or the non terminal side of the rule every one of these symbols has a value associated with it and that includes both tokens as well as the non-terminal references themselves and if we don't do anything with these in our actions the default is that the value on the left is going to be assigned the value of the first token in the production on the right-hand side so here's an example of how I might use these to basically reflect or represent the semantics of a production so I have my production statement : identifier plus identifier and what I mean there is that statement on the left should be assigned the value of the first token added to the third token so I have dollar dollar equals dollar one plus dollar three so I can just use these as variables if you in my action and the rest of the statements basically are valid C code so we simply use those as substitutes in yeah chronic processes processes this and generates code will turn these into valid C for us in the second production here I have statement identify our minus identifier so there I mark down the semantics as dollar dollar equals the first value so percent one minus the third percent three now the third part of my yack input file contains valid C code that supports the language processing often we'll find in here a symbol table implementation something some type of data structure that we use to keep track of the different identifiers that we encounter in the source code I also might have the implementation of functions here that are called by actions associated with the production's in the second part so in the first part I might have the actual headers or the references to my functions the prototypes of my functions if you will and in that third part I might actually put the implementation so let's go ahead and look at a hands-on demo and see how we can use yak along with Lex to create a non-trivial language processing system so I've already written a example program and what we're going to do now is walk through the source code both the Lex and the yak source code but before we look at the source code let's go ahead and run the program so my I've already compiled it and built it so I want to run it and demonstrate it so you understand what it does and then we'll look at how we implemented it so this is a very simple program that is I call it calc and it's a very primitive calculator and it's going to interpret different arithmetic expressions that I give it and so the syntax is as follows I can use variable names and variable names are going to be one character long and they can be either a lowercase or an uppercase letter so I can say something like this I can say a is equal to 1 plus 100 semicolon and that basically has assigned the value of 101 to a and I can confirm that by using the print statement so if I say print a semicolon it prints out the value of a which is 101 and I can use a now that it's bound to a value I can use it in other expressions so I could say B equals a minus 10 semicolon and now if I were to print out B I get 91 just as expected I can also print out expressions directly so I could say something like this I could say print a plus B and I get the value 192 and then finally so I can do addition and subtraction and I can print those values out I can store them in any of my 52 variables and I also have a command called exit if I type exit semicolon the interpreter actually exits so that's the very simple language that I've built and let's start by looking at the YAC input so this is called calc dot y and up here in my see declarations I've got a few things that I'm going to need so first of all I've got this void YY air so this is something that's designed defined elsewhere that my parser is going to need to call whenever there's a syntactic elaire in other words something that doesn't suit the grammar I'm going to use some standard i/o so I've got my standard i/o dot H and I also reference some symbols from standard Lib that H so I have that now this next thing here my array symbols is actually a really primitive symbol table implementation so I have up to 52 different variables lowercase a through Z and uppercase A through Z you'll recall a moment ago I said that variable names are restricted to one care after each and they're going to be alphabetical characters lower or upper so I'm going to represent these symbols I'm going to have basically a table of 52 different integers and that's where I'm going to store the respective values then I wrote two functions the first one is called symbol Val and if you give that a character like a lowercase a or an uppercase a it's going to basically go and look up the value of that variable in the symbol table I've got a second function called update symbol Val and if you give this one a symbol so an A or a B or some character and a value it will basically make sure that the respective entry in the symbol table for that symbol gets the value that I'm passing it so the first function reads a value returns it to me the second one will update the symbol table with that value then I've got some yak definitions the first thing you'll notice here is percent Union and what percent Union lets me do and yak is it lets me specify the different types that my lexical analyzer can return and in this case the different types that my analyzer can return at least the values I care about are going to be integers which I'm going to put in the member num or characters which I'm going to put in the element ID and you'll remember in C a Union actually lets me treat the same storage area with different types and ultimately when we run this through yak it's going to represent this with a C Union so if you never thought you'd see the day where you use the Union and see you've just met that day so the very next rule here is a percent start and that's going to indicate which of the productions that follow in the middle section is going to be my starting rule or my starting production then I've got percent token print and that's just saying I have a token that I'm expecting from my lexical analyzer and I'm going to refer to that token type as print similarly I have one called exit command percent token exit underscore command and when we run this through yak it's going to generate a header file that defines or passes this information as pound defines in this header so our lexical analyzer can actually reference these values as we'll see in a moment now the third token here is called number but you'll notice this kind of funky num in front of it in angle a greater than less than symbols here and that's basically telling us that this token number will get stored in the member num in the Union type similarly I have a token called identify err and whenever I see an identify err it's going to get returned by the lexical analysis in the member variable ID in my Union these last two statements these percent types are basically assigning types to the non terminals that are going to appear on the left hand side of my grammar so a line an expression in a term are all going to map to the type num which we know is an integer and the type the type of an assignment is always going to map to an identifier an ID so that's the first section let's move on and look at the second section of our yak file so these are the actual grammars grammar productions as well as the associated actions so my start rule you'll recall was a line and aligned maps to one two three four five five different six different productions here so a line can be an assignment and that's a non-terminal that we've defined below you can actually see that so assignment is actually an identifier equals expression so a line could be just a single assignment with a semicolon and the action here if I have that is basically nothing I'm not going to do anything now you'll notice down here that when I receive or when I recognize an assignment I actually call the function update symbol Val and I pass it dollar one which is the very first token on the right hand side of the production which is identifier the second value that I pass is dollar three which is going to be the value of the expression which is another num terminal that's defined below so what we're going to be doing is assigning to this identifier the value of this expression that's what this rule does here on an exit command I'm simply going to call the C function exit and then pass exit underscore success and exit is a call in the standard library that's why I had the pone standard live up there and exit success just says everything's working right you know the air program is not returning any kind of error indicator so when I see an exit command I'm going to simply exit the program when I see a print followed by an expression I'm going to print out the value of that expression I know it's an integer so my control string has a percent D and then I reference dollar two which is the value assigned to this non-terminal now I also have some recursive productions here so a line also maps to line and then an assignment okay when I see this once again I'm going to just have an empty statement because I've already done the assignment down here I could have a line print expression and in that case I'm going to print out the third value which is the expression and I could have a line exit command so these three productions here are what let me have more than one statement in a single input file so those are the one these recursive rules are the ones that let us repetitively add statements to our program so as we work our way down the productions we've already looked at assignment let's look at exp or expression expression maps to a term and terms are either going to be numbers so literal integers or identifiers x' variables in this case a single alphabetical character if I recognize a term I'm simply going to assign to the left-hand side so the exp is going to get the value of that term and so the value of that term if it's a number is going to literally be its its face value and remember it was being returned as a number if it's a identifier I'm going to return a the actual value which i look up with my function symbol val through dollar one so I passed dollar one to symbol Vale which is the identifier and it looks up the value in the simple table and assigns it now expressions also map to expressions with a plus term so here's how I do my addition and the semantics aren't on that are the expression gets dollar one which is exp plus dollar three which is the value of this term similarly I have a rule exp - term and this one maps to dollar one - dollar three so those are the simple rules in my language and in the last section I've got some helper functions that I need the implementation of the update symbol Val as well as a symbol Val both of these functions use this utility function that I wrote compute symbol index so if you give me a single alphabetical character this function here is basically going to figure out which of the 52 elements in the symbol array that particular alphabetical character maps to so we basically check if it's lowercase we compute the token or the value of the index is the value of the character - the the base which is the lowercase + 26 and the uppercase ones we just subtract the base and so lowercase a through Z are going to be 0 through 25 and the uppercase or the lower upper case will be 0 through 25 in the lower case are going to map from 26 to 51 and here's my function symbol Val remember this is going to look up the value of the symbol and return it so in this case I can compute my symbol index which is the array the offset in my symbol table array for that particular variable and once I have computed it I simply return the value of it this function here update symbol Val is going to do the same thing compute the index but then it's going to basically assign the val to that entry and here is my main program i'm going to initialize my symbol table to all zeros so i'm going to make the assumption before I see any expressions that all my variables are initialized to zero and then I'm simply going to call YY parse which is the function generated by Y ax so once I call y Y parse it's going to iteratively apply the grammar rules to the input until it either runs out of input or it actually finds a syntax error so that is all I need and I also define my y Y air down there because the parser is going to call that if it gets syntax errors and we're simply going to print out the air in that case but that is the Yak file and if we quickly take a look at the Lexx file that is called kal-el and we've seen these in a previous tutorial it's very simple I include my Y dat tab dot H and this file is what's generated my my pound defines for the different token types so each of those % tokens is defined inside here so if I see the the string print I'm going to return the touken print if i see the term exit i'm going to return the token exit command so print an exit command here we're both defined as percent tokens over in my yak file and that generated code here is going to define those for me now here's where the Union comes into play this is my regular expression for representing my variable names which are going to be a single character a through Z lowercase or A through Z uppercase so identifiers we're going to store in that ID element so now we can say instead of assigning something to Yui Vale we can simply say Yui Vale that ID so it's why my Val now is a union type and I'm going to take the very first character of the token or YY text sub-zero and then the type that I'm going to return so this is this first statement here is storing the value of the token and then the second statement is returning the type and remember it's an identifier when I recognize a digit which is 0 or more characters or digits I'm sorry one or more digits now I'm going to assign that to YY val dot num which was my integer value and I'm going to store two that the textual value converted into an integer and the type I return is a number I ignore my whitespace and if I see a minus a plus a equal or a semicolon I simply return its face value and you'll remember in the grammars we actually reference those as characters in single ticks anything else that I see I'm going to print out an error message and we also have in the third part the wire wrap which we've talked about in the previous tutorial and we won't discuss further so a very simple Lex file that simply leverages the different symbols that I've defined inside the Yaak so now we need to build this so the first thing we want to do is run yak so I'm going to say yak and I'm going to use the minus D option to get it to generate the white th if you don't do this you're going to have problems when you try to compile your final program so the first step is to run yak on our yak input file and this generates two files it generates for us a white tab that H which is our header and a white tee AB dot C which is the actual generated parser the next step now that I have my white @ th is to do is to run Lex on my Lex file and this is going to generate a Lex dot YY dot C which you can see right here and now my last step is to compile both of these together so I'm going to say GCC Lex stop by YC + y dot ABC and I'll put my output in calc and now I have my executable interpreter and if I type an error it basically detects that and so there you can see how we can use Lex and yak to create a language processor without writing a great deal of code and you can imagine with more sophisticated grammars we can write some fairly interesting languages using this



Description:
More from this creator:
This is part 2 in our tutorial on lex / yacc. In this tutorial, we take a look at yacc, and see how it can be used together with lex to create a simple language processing system. You can download the source code for this tutorial at: https://github.com/jengelsma/yacc-tutorial

Disclaimer:
TranscriptionTube is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com
Contact:
You may contact the administrative operations team of TranscriptionTube with any inquiries here: Contact
Policy:
You may read and review our privacy policy and terms of conditions here: Policy