lex(1) User Commands lex(1) NAME lex - lexical analysis program generator SYNOPSIS lex [ -cntv ] [ -e | -w ] [ -V -Q [ y | n ] ][ _f_i_l_e_n_a_m_e ] ... DESCRIPTION lex generates programs to be used in simple lexical analysis of text. Each _f_i_l_e_n_a_m_e (the standard input by default) con- tains regular expressions to search for, and actions written in C to be executed when expressions are found. A C source program, lex.yy.c is generated, to be compiled as follows: cc lex.yy.c -ll This program, when run, copies unrecognized portions of the input to the output, and executes the associated C action for each regular expression that is recognized. The actual string matched is left in yytext, an external character array (a wchar_t array when the -w option is given). Matching is done in order of the strings in the file. The strings may contain square braces to indicate character classes, as in [abx-z] to indicate a, b, x, y, and z; and the operators *, + and ?, which mean, respectively, any non- negative number, any positive number, or either zero or one occurrences of the previous character or character-class. The dot character (`.') is the class of all characters except NEWLINE. Parentheses for grouping and vertical bar for alternation are also supported. The notation _r{_d,_e} in a rule indicates instances of regular expression _r between _d and _e. It has a higher precedence than |, but lower than that of *, ?, +, or concatenation. The ^ (carat character) at the beginning of an expression permits a successful match only immediately after a NEWLINE, and the $ character at the end of an expression requires a trailing NEWLINE. The / character in an expression indicates trailing context; only the part of the expression up to the slash is returned in yytext, although the remainder of the expression must follow in the input stream. An operator character may be used as an ordinary symbol if it is within '' symbols or is preceded by `\'. Three subroutines defined as macros are expected: input() to read a character; unput(_c) to replace a character read; Sun Microsystems Last change: 5 Mar 1992 1 lex(1) User Commands lex(1) and output(_c) to place an output character. They are defined in terms of the standard streams, but you can over- ride them. For C++ code, input() is renamed lex_input(), and ouput() is renamed lex_output() to avoid name conflicts with iostreams. The program generated is named yylex(), and the lex library libl.a contains a main() which calls it. The action REJECT on the right side of the rule rejects this match and executes the next suitable match; the function yymore() accumulates additional characters into the same yytext; and the function yyless(_p) pushes back the portion of the string matched beginning at _p, which should be between yytext and yytext+yyleng. The macros _i_n_p_u_t and _o_u_t_- _p_u_t use files yyin and yyout to read from and write to, defaulted to stdin and stdout, respectively. In a lex program, any line beginning with a blank is assumed to contain only C text and is copied; if it precedes %% it is copied into the external definition area of the lex.yy.c file. All rules should follow a %%, as in YACC. Lines preceding %% which begin with a nonblank character define the string on the left to be the remainder of the line; it can be used later by surrounding it with {}. Note: curly brackets do not imply parentheses; only string substitution is done. The external names generated by lex all begin with the pre- fix yy or YY. Certain table sizes for the resulting finite-state machine can be set in the definitions section: %p _n number of positions is _n (default 2000) %n _n number of states is _n (default 500) %e _n number of parse tree nodes is _n (default 1000) %a _n number of transitions is _n (default 3000) The use of one or more of the above automatically implies the -v option, unless the -n option is used. Programs generated by lex(1) need either the -e or -w option to handle input that contains EUC characters from supplemen- tary codesets. If neither of these options is specified, yytext is of the type char[], and the generated program can handle only ASCII characters. When the -e option is used, yytext is of the type unsigned char[] and yyleng gives the total number of _b_y_t_e_s in the matched string. With this option, the macros input(), unput(_c), and output(_c) should do a byte-based I/O in the Sun Microsystems Last change: 5 Mar 1992 2 lex(1) User Commands lex(1) same way as with the regular ASCII lex(1). Two more vari- ables are available with the -e option, yywtext and yywleng, which behave the same as yytext and yyleng would under the - w option. When the -w option is used, yytext is of the type wchar_t[] and yyleng gives the total number of _c_h_a_r_a_c_t_e_r_s in the matched string. If you supply your own input(), unput(_c), or output(_c) macros with this option, they must return or accept EUC characters in the form of wide character (_w_c_h_a_r__t). This allows a different interface between your program and the lex internals, to expedite some programs. When either the -e or -w option is used, the generated C program must be linked with the wide character library libw.a using the -lw linker flag. Pattern Matching When either the -e or -w option is used, patterns used in rules can include characters from both primary and supple- mentary codesets. The generated program performs pattern matching correctly on an input stream containing EUC charac- ters from supplementary codesets. You may use any valid EUC characters in a character range [_A -_Z] as long as _A and _Z belong to the same codeset. "." matches any character from any codeset (except NEWLINE). International Caveats Start condition names must consist solely of ASCII charac- ters. The "%T" directive can not be used when either the -w or - e option is used. The default main() found in the lex library (libl.a) does not have a setlocale(3C) call. Thus, the resulting program would not recognize non-ASCII characters correctly. You have to supply your own main() in order to have your program handle EUC characters correctly. The simplest main() would be: #includemain(){ setlocale(LC_ALL, ""); yylex(); } OPTIONS -c Indicates C actions and is the default. Sun Microsystems Last change: 5 Mar 1992 3 lex(1) User Commands lex(1) -e Generate a program that can handle EUC characters (cannot be used with the -w option). yytext[] is of type unsignedchar[]. -n Opposite of -v; -n is the default. -t Place the result on the standard output instead of in file lex.yy.c. -v Print a one-line summary of statistics of the gen- erated analyzer. -w Generate a program that can handle EUC characters (cannot be used with the -e option). Unlike the -e option, yytext[] is of type wchar_t[]. -V Print out version information on standard error. -Q[y|n] Print out version information to output file lex.yy.c by using -Qy . The -Qn option does not print out ver- sion information and is the default. EXAMPLES The command line, lex lexcommands draws lex instructions from the file lexcommands, and places the output in lex.yy.c. The following example lex program converts uppercase to lower, removes blanks at the end of lines, and replaces mul- tiple blanks by single blanks. %% [A-Z] putchar (yytext[0]+'a'-'A'); [ ]+$ ; [ ]+ putchar(' '); INTERNATIONAL EXAMPLES The following is a similar program for the "japanese" locale environment: %% [\x30001221-\x30001273] putwchar (yytext[0]+0x0080); [ \x300010a1]+$ ; [ \x300010a1]+ putchar(' '); %% #include main(){ setlocale(LC_ALL, ""); Sun Microsystems Last change: 5 Mar 1992 4 lex(1) User Commands lex(1) yylex(); } This program converts every hiragana character (of which the EUC wide character value is between 0x30001221 and 0x30001273) to the corresponding katakana character. It also recognizes double-space character (0x300010a1). 0x0080 is the offset between corresponding hiragana and katakana characters when represented in wide characters. Note that use of the hexadecimal escape sequence in this example is not really needed. The corresponding EUC characters could have been used instead. This program must be compiled with the -lw option and linked with the wide character library libw.a. Compilation and execution must be done in an environment where either the LANG or LC_CTYPE environment variable is set to japanese. The command line for compiling this program would be: % lex -w sample.l % cc -o sample lex.yy.c -ll -lw FILES lex.yy.c default output file when - t is not specified /usr/ccs/lib/libl.a lex library ncform nceucform C-program prototypes SEE ALSO sed(1), yacc(1), setlocale(3C) lex in _S_u_n_O_S _5._3 _P_r_o_g_r_a_m_m_i_n_g _U_t_i_l_i_t_i_e_s NOTES Ratfor is no longer supported as a host language. The way to use hexadecimal escape sequences for multibyte characters differs from the versions of lex of previous release of SunOS Asian Language Environment, namely JLE, KLE, CLE and HLE. In these versions, a multibyte character was written as a sequence of hexadecimal escape sequences, one per byte, rather than as one hexadecimal escape sequence representing the character's wide character value. Sun Microsystems Last change: 5 Mar 1992 5