ABSTRACT
Kees Bot
Edwin Scheffer
Vrije Universiteit
Amsterdam, The Netherlands
This document describes the implementation of an Occam to EM compiler. The lexical analysis is done using Lex. For the semantic analysis the extended LL(1) parser generator LLgen is used. To handle the Occam-specific features as channels and parallelism some library routines are required.
Occam [1] is a programming language which is based on the concepts of concurrency and communication. These concepts enable today’s applications of microprocessors and computers to be implemented more effectively.
An Occam program consists of a (dynamically determined) number of processes communicating through channels. To communicate with the outside world some predefined channels are needed. A channel has only one writer and one reader; it carries machine words and bytes, at the reader/writer’s discretion. The process with its communication in Occam replaces the procedure with parameters in other languages (there are no procedures in Occam).
In addition to the normal assignment statement, Occam has two more information-transfer statements, the input and the output:
chan1 ? x -- reads a value from chan1 into x chan2 ! x -- writes the value of x onto chan2
Both the outputting and the inputting processes wait until the other is there. Channels are declared and given names.
2
Arrays of channels are possible.
Processes come in 5 varieties: sequential, parallel, alternative, conditional and repetitive. A process starts with a reserved word telling its nature, followed by an indented list of other processes. (Indentation is used to indicate block structure.) It may be preceded by declarations. The processes in a sequential/parallel process are executed sequentially/in parallel. The processes in an alternative process have guards based on the availability of input; the first to be ready is executed (this is waiting for multiple input). The conditional and repetitive processes are normal IFs and WHILEs.
Producer-consumer example:
CHAN buffer: -- declares the channel buffer PAR WHILE TRUE -- the producer VAR x: -- a local variable SEQ produce(x) -- in some way buffer ! x -- and send it WHILE TRUE -- the consumer VAR x: SEQ buffer ? x -- get a value consume(x) -- in some way
3
Processes can be replicated from a given template; this combines with arrays of variables and/or channels.
Example: 20 window-sorters in series:
CHAN s[20]: -- 20 channels PAR i = [ 0 FOR 19 ] -- 19 processes WHILE TRUE VAR v1, v2: SEQ s[i] ? v1; v2 -- wait for 2 variables from s[i] IF v1 <= v2 -- ok s[i+1] ! v1; v2 v1 > v2 -- reorder s[i+1] ! v2; v1
A process may wait for a condition, which must include a comparison with NOW, the present clock value.
Processes may be distributed over several processors; all processes under a VAR declaration must run on the same processor. Concurrency can be improved by avoiding VAR declarations, and replacing them by CHAN declarations. Processes can be allocated explicitly on named processors and channels can be connected to physical ports.
The compiler is written in C using LLgen and Lex and compiles Occam programs to EM code, using the procedural interface as defined for EM. In the following sub-sections we describe the LLgen parser generator and the aspect of indentation.
LLgen accepts a Context Free syntax extended with the operators ‘*’, ‘?’ and ‘+’ that have effects similar to those in regular expressions. The ‘*’ is the closure set operator without an upperbound; ‘+’ is the positive closure operator without an upperbound; ‘?’ is the optional operator; ‘[’ and ‘]’ can be used for grouping. For example, a comma-separated list of expressions can be
4
described as:
expression_list: expression [ ’,’ expression ]* ;
Alternatives must be separated by ‘|’. C code (‘‘actions’’) can be inserted at all points between the colon and the semicolon. Variables global to the complete rule can be declared just in front of the colon enclosed in the brackets ‘{’ and ‘}’. All other declarations are local to their actions. Nonterminals can have parameters to pass information. A more mature version of the above example would be:
expression_list(expr *e;) { expr e1, e2; } : expression(&e1) [ ’,’ expression(&e2) { e1=append(e1, e2); } ]* { *e=e1; } ;
As LLgen generates a recursive-descent parser with no backtrack, it must at all times be able to determine what to do, based on the current input symbol. Unfortunately, this cannot be done for all grammars. Two kinds of conflicts are possible, viz. the alternation and repetition conflict. An alternation confict arises if two sides of an alternation can start with the same symbol. E.g.
5
plus: ’+’ | ’+’ ;
The parser doesn’t know which ‘+’ to choose (neither do we). Such a conflict can be resolved by putting an if-condition in front of the first conflicting production. It consists of a ‘‘%if’’ followed by a C-expression between parentheses. If a conflict occurs (and only if it does) the C-expression is evaluated and parsing continues along this path if non-zero. Example:
plus: %if (some_plusses_are_more_equal_than_others()) ’+’ | ’+’ ;
A repetition conflict arises when the parser cannot decide whether ‘‘productionrule’’ in e.g. ‘‘[ productionrule ]*’’ must be chosen once more, or that it should continue. This kind of conflicts can be resolved by putting a while-condition right after the opening parentheses. It consists of a ‘‘%while’’ followed by a C-expression between parentheses. As an example, we can look at the comma-expression in C. The comma may only be used for the comma-expression if the total expression is not part of another comma-separated list:
comma_expression: sub_expression [ %while (not_part_of_comma_separated_list()) ’,’ sub_expression ]*
6
; |
Again, the ‘‘%while’’ is only used in case of a conflict.
Error recovery is done almost completely automatically. All the LLgen-user has to do is write a routine called LLmessage to give the necessary error messages and supply information about terminals found missing.
The way conflicts can be resolved are of great use to Occam. The use of indentation, to group statements, leads to many conflicts because the spaces used for indentation are just token separators to the lexical analyzer, i.e. ‘‘white space’’. The lexical analyzer can be instructed to generate ‘BEGIN’ and ‘END’ tokens at each indentation change, but that leads to great difficulties as expressions may occupy several lines, thus leading to indentation changes at the strangest moments. So we decided to resolve the conflicts by looking at the indentation ourselves. The lexical analyzer puts the current indentation level in the global variable ind for use by the parser. The best example is the SEQ construct, which exists in two flavors, one with a replicator and one process:
seq i = [ 1 for str[byte 0] ] out ! str[byte i]
and one without a replicator and several processes:
seq in ? c out ! c
7
The LLgen skeleton grammar to handle these two is:
SEQ { line=yylineno; oind=ind; } [ %if (line==yylineno) replicator process | [ %while (ind>oind) process ]* ]
This shows clearly that, a replicator must be on the same line as the SEQ, and new processes are collected as long as the indentation level of each process is greater than the indentation level of SEQ (with appropriate checks on this identation).
Different indentation styles are accepted, as long as the same amount of spaces is used for each indentation shift. The ascii tab character sets the indentation level to an eight space boundary. The first indentation level found in a file is used to compare all other indentation levels to.
It is now time to describe the implementation of some of the occam-specific features such as channels and NOW. Also the way communication with UNIX† is performed must be described.
For a thorough description of the library routines to simulate parallelism, which are e.g. used by the channel routines and by the PAR construct in Appendix B, see [6].
There are currently two types of channels (see Figure 1.) indicated by the type field of a channel variable:
- |
An interprocess communication channel with two additional fields: |
- |
A synchronization field to hold the state of an interprocess communication channel. |
8
- |
An integer variable to hold the value to be send. |
- |
An outside world communication channel. This is a member of an array of channels connected to UNIX files. Its additional fields are: |
- |
A flags field holding a readahead flag and a flag that tells if this channel variable is currently connected to a file. |
- |
A preread character, if readahead is done. |
- |
An index field to find the corresponding UNIX file. |
Figure 1. Interprocess and outside world communication channels
The basic channel handling is done by chan_in and chan_out. All other routines are based on them. The routine chan_any only checks if there’s a value available on a given channel. (It does not read this value!) C_init initializes an array of interprocess communication channels.
9
The following table shows Occam statements paired with the routines used to execute them.