2. Use of Lex in syntaxical analysis

The purpose of the lexical analysis is to transform a series of symbols into tokens (a token may be a number, a "+" sign, a reserved word for the language, etc.). Once this transformation is done, the syntaxical analyser will be able to do its job (see below). So, the aim of the lexical analyser is to consum symbols and to pass them back to the syntaxical analyser.

A Lex description file can be divided into three parts, using the following plan:

additionnal code

in which no part is required. However, the first %% is required, in order to mark the separation between the declarations and the productions.

2.1. First part of a Lex file: declarations

This part of a Lex file may contain:



#include "calc.h"

#include <stdio.h>
#include <stdlib.h>


/* Regular expressions */
/* ------------------- */

white		[\t\n ]+
letter		[A-Za-z]
digit10         [0-9]			/* base 10 */
digit16 	[0-9A-Fa-f]		/* base 16 */

identifier	{letter}(_|{letter}|{digit10})*
int10   	{digit10}+

The example by itself is, I hope, easy to understand, but let's have a deeper look into regular expressions.

2.2. Regular expressions

Symbol Meaning
x the "x" character
. any character except \n
[xyz] either x, y or z
[^bz] any character, except b and z
[a-z] any character between a and z
[^a-z] any character, except those between a and z
R* zero R or more; R can be any regular expression
R+ one R or more
R? one or zero R (that is an optional R)
R{2,5} 2 to 5 R
R{2,} 2 R or more
R{2} exactly 2 R
"[xyz\"foo" the string "[xyz"foo"
{NOTION} expansion of NOTION, that as been defined above in the file
\X if X is a "a", "b", "f", "n", "r", "t", or "v", this represents the ANSI-C interpretation of \X
\0 ASCII 0 character
\123 the caracter which ASCII code is 123, in octal
\x2A the caracter which ASCII code is 2A, in hexadecimal
RS R followed by S
R|S R or S
R/S R, only if followed by S
^R R, only at the beginning of a line
R$ R, only at the end of a line
<<EOF>> end of file

So the definition:

identifier  {letter}(_|{letter}|{digit10})*

will match as identifiers the words "integer", "a_variable", "a1", but not "_ident" nor "1variable". Easy, isn't it?

As a last example, this is the definition of a real number:

digit		[0-9]
integer		{digit}+
exponant	[eE][+-]?{integer}
real		{integer}("."{integer})?{exponant}?

2.3. Second part of a Lex file: productions

This part is aimed to instruct Lex what to do in he generated analyser when it will encounter one notion or another one. It may contain:

You should also note that comment such as /* ... */ can be present in the second part of a Lex file only if enclosed between braces, in the action part of the statements. Otherwise, Lex would consider them as regular expressions or actions, which would give errors, or, at least, a weird behaviour.

Finally, the yytext variable used in the actions contains the characters accepted by the regular expression. This is a char table, of length yyleng (ie, char yytext[yyleng]).


[ \t]+$         ;
[ \t]+          printf(" ");

This little Lex file will generate a program that will suppress the space characters that are not useful. You can also notice with that little program that Lex is not reserved to interpreters or compilers, and can be used, for example, for searches and replaces, etc.

2.4. Third part: additional code

You can put in this optional part all the code you want. If you don't put anything here, Lex will consider that it is just:

main() {

2.5. Conclusion about Lex

As you can see it, Lex is quite easy to use (but it can be more complicated if you use start conditions). We have not seen all the possibilities of Lex, and I suggest that you read its man page for a more detailed information.