Lexical Analysis Program Generator

lex(1)                    User Commands                    lex(1)



NAME
     lex - lexical analysis program generator

SYNOPSIS
     lex [ -cntv ] [ -e | -w ] [ -V -Q [ y | n ]  ][  _f_i_l_e_n_a_m_e  ]
     ...

DESCRIPTION
     lex generates programs to be used in simple lexical analysis
     of text.  Each _f_i_l_e_n_a_m_e (the standard input by default) con-
     tains regular expressions to search for, and actions written
     in C to be executed when expressions are found.

     A C source program, lex.yy.c is generated, to be compiled as
     follows:

          cc lex.yy.c -ll

     This program, when run, copies unrecognized portions of  the
     input  to  the  output, and executes the associated C action
     for each regular expression that is recognized.  The  actual
     string  matched  is  left  in  yytext, an external character
     array (a wchar_t array when the -w option is given).

     Matching is done in order of the strings in the  file.   The
     strings  may  contain  square  braces  to indicate character
     classes, as in [abx-z] to indicate a, b, x, y,  and  z;  and
     the operators *, + and ?, which mean, respectively, any non-
     negative number, any positive number, or either zero or  one
     occurrences  of  the  previous character or character-class.
     The dot character (`.')  is  the  class  of  all  characters
     except NEWLINE.

     Parentheses for grouping and vertical  bar  for  alternation
     are also supported.  The notation _r{_d,_e} in a rule indicates
     instances of regular expression _r between _d and _e.  It has a
     higher precedence than |, but lower than that of *, ?, +, or
     concatenation.  The ^ (carat character) at the beginning  of
     an  expression  permits  a successful match only immediately
     after a NEWLINE, and the  $  character  at  the  end  of  an
     expression requires a trailing NEWLINE.

     The / character in an expression indicates trailing context;
     only  the part of the expression up to the slash is returned
     in yytext, although the remainder  of  the  expression  must
     follow in the input stream.

     An operator character may be used as an ordinary  symbol  if
     it is within '' symbols or is preceded by `\'.

     Three subroutines defined as macros are  expected:   input()
     to  read  a character; unput(_c) to replace a character read;



Sun Microsystems     Last change: 5 Mar 1992                    1






lex(1)                    User Commands                    lex(1)



     and output(_c)  to  place  an  output  character.   They  are
     defined  in terms of the standard streams, but you can over-
     ride them.  For C++ code, input()  is  renamed  lex_input(),
     and  ouput() is renamed lex_output() to avoid name conflicts
     with iostreams.  The program generated is named yylex(), and
     the  lex  library  libl.a  contains a main() which calls it.
     The action REJECT on the right side of the rule rejects this
     match  and  executes  the  next suitable match; the function
     yymore() accumulates additional  characters  into  the  same
     yytext;  and  the function yyless(_p) pushes back the portion
     of the string  matched  beginning  at  _p,  which  should  be
     between yytext and yytext+yyleng.  The macros _i_n_p_u_t and _o_u_t_-
     _p_u_t use files yyin and yyout to  read  from  and  write  to,
     defaulted to stdin and stdout, respectively.

     In a lex program, any line beginning with a blank is assumed
     to  contain  only C text and is copied; if it precedes %% it
     is copied into the external definition area of the  lex.yy.c
     file.   All  rules  should  follow  a  %%, as in YACC. Lines
     preceding %% which begin with a  nonblank  character  define
     the  string  on the left to be the remainder of the line; it
     can be used later by surrounding it with  {}.   Note:  curly
     brackets  do not imply parentheses; only string substitution
     is done.

     The external names generated by lex all begin with the  pre-
     fix yy or YY.

     Certain table sizes for the resulting  finite-state  machine
     can be set in the definitions section:

          %p _n number of positions is _n (default 2000)

          %n _n number of states is _n (default 500)

          %e _n number of parse tree nodes is _n (default 1000)

          %a _n number of transitions is _n (default 3000)

     The use of one or more of the  above  automatically  implies
     the -v option, unless the -n option is used.

     Programs generated by lex(1) need either the -e or -w option
     to handle input that contains EUC characters from supplemen-
     tary codesets.  If neither of these  options  is  specified,
     yytext  is of the type char[], and the generated program can
     handle only ASCII characters.

     When the -e option is used, yytext is of the  type  unsigned
     char[]  and  yyleng  gives  the total number of _b_y_t_e_s in the
     matched string.   With  this  option,  the  macros  input(),
     unput(_c),  and  output(_c)  should do a byte-based I/O in the



Sun Microsystems     Last change: 5 Mar 1992                    2






lex(1)                    User Commands                    lex(1)



     same way as with the regular ASCII lex(1).  Two  more  vari-
     ables are available with the -e option, yywtext and yywleng,
     which behave the same as yytext and yyleng would under the -
     w option.

     When the -w option is used, yytext is of the type  wchar_t[]
     and  yyleng  gives  the  total  number  of _c_h_a_r_a_c_t_e_r_s in the
     matched string.  If you supply your own  input(),  unput(_c),
     or  output(_c)  macros  with this option, they must return or
     accept  EUC  characters  in  the  form  of  wide   character
     (_w_c_h_a_r__t).   This  allows a different interface between your
     program and the lex internals, to expedite some programs.

     When either the -e or -w option is  used,  the  generated  C
     program  must  be  linked  with  the  wide character library
     libw.a using the -lw linker flag.

  Pattern Matching
     When either the -e or -w option is used,  patterns  used  in
     rules  can  include characters from both primary and supple-
     mentary codesets.  The generated  program  performs  pattern
     matching correctly on an input stream containing EUC charac-
     ters from supplementary codesets.

     You may use any valid EUC characters in a character range [_A
     -_Z] as long as _A and _Z belong to the same codeset.

     "." matches any character from any codeset (except NEWLINE).

  International Caveats
     Start condition names must consist solely of  ASCII  charac-
     ters.

     The "%T" directive can not be used when either the -w or - e
     option is used.

     The default main() found in the lex  library  (libl.a)  does
     not  have a setlocale(3C) call.  Thus, the resulting program
     would not recognize  non-ASCII  characters  correctly.   You
     have to supply your own main() in order to have your program
     handle EUC characters correctly.  The simplest main()  would
     be:

                  #include 
                  main(){
                          setlocale(LC_ALL, "");
                          yylex();
                  }

OPTIONS
     -c    Indicates C actions and is the default.




Sun Microsystems     Last change: 5 Mar 1992                    3






lex(1)                    User Commands                    lex(1)



     -e    Generate a program  that  can  handle  EUC  characters
          (cannot be used with the -w option).
          yytext[] is of type unsignedchar[].

     -n    Opposite of -v; -n is the default.

     -t    Place the result on the standard output instead of  in
          file lex.yy.c.

     -v    Print a one-line summary of  statistics  of  the  gen-
          erated analyzer.

     -w    Generate a program  that  can  handle  EUC  characters
          (cannot be used with the -e option).
          Unlike the -e option, yytext[] is of type wchar_t[].

     -V    Print out version information on standard error.

     -Q[y|n]
          Print out version information to output  file  lex.yy.c
          by  using  -Qy . The -Qn option does not print out ver-
          sion information and is the default.

EXAMPLES
     The command line,
          lex lexcommands
     draws lex instructions from the file lexcommands, and places
     the output in lex.yy.c.

     The following example  lex  program  converts  uppercase  to
     lower, removes blanks at the end of lines, and replaces mul-
     tiple blanks by single blanks.


          %%
          [A-Z]   putchar (yytext[0]+'a'-'A');
          [ ]+$   ;
          [ ]+    putchar(' ');

  INTERNATIONAL EXAMPLES
     The following is a similar program for the "japanese" locale
     environment:


          %%
          [\x30001221-\x30001273] putwchar (yytext[0]+0x0080);
          [ \x300010a1]+$         ;
          [ \x300010a1]+          putchar(' ');
          %%
          #include 
          main(){
                  setlocale(LC_ALL, "");



Sun Microsystems     Last change: 5 Mar 1992                    4






lex(1)                    User Commands                    lex(1)



                  yylex();
          }

     This program converts every hiragana character (of which the
     EUC   wide   character   value  is  between  0x30001221  and
     0x30001273) to the  corresponding  katakana  character.   It
     also recognizes double-space character (0x300010a1).  0x0080
     is the offset between corresponding  hiragana  and  katakana
     characters  when  represented in wide characters.  Note that
     use of the hexadecimal escape sequence in  this  example  is
     not  really  needed.  The corresponding EUC characters could
     have been used instead.
     This program must be compiled with the -lw option and linked
     with  the  wide  character  library libw.a.  Compilation and
     execution must be done in an environment  where  either  the
     LANG  or  LC_CTYPE  environment variable is set to japanese.
     The command line for compiling this program would be:

          % lex -w sample.l
          % cc -o sample lex.yy.c -ll -lw

FILES
     lex.yy.c            default output file  when   - t  is  not
                         specified
     /usr/ccs/lib/libl.a lex library
     ncform
     nceucform           C-program prototypes

SEE ALSO
     sed(1), yacc(1), setlocale(3C)

     lex in _S_u_n_O_S _5._3 _P_r_o_g_r_a_m_m_i_n_g _U_t_i_l_i_t_i_e_s

NOTES
     Ratfor is no longer supported as a host language.

     The way to use hexadecimal escape  sequences  for  multibyte
     characters  differs  from  the  versions  of lex of previous
     release of SunOS Asian  Language  Environment,  namely  JLE,
     KLE,  CLE and HLE.  In these versions, a multibyte character
     was written as a sequence of hexadecimal  escape  sequences,
     one  per  byte,  rather  than  as  one   hexadecimal  escape
     sequence representing the character's wide character value.












Sun Microsystems     Last change: 5 Mar 1992                    5