Note 2, Programming Language Concepts (sestoft@dina.kvl.dk) 2002-02-13 ---------------------------------------------------------------------- This lecture introduces the distinction between interpreters and compilers, and demonstrates some concepts of compilation, taking the simple expression language as an example. Some concepts of interpretation are illustrated also, using a stack machine as an example. An interpreter executes a program on some input, producing an output or result. An interpreter is usually itself a program, but one might also say that an Intel x86 processor or a Compaq Alpha processor is an interpreter, implemented in silicon. For an interpreter program we must distinguish the interpreted language L (the language of the programs being executed, for instance our expression language expr) from the implementation language I (the language in which the interpreter is written, for instance SML). When program in the interpreted language L is a sequence of simple instructions, and thus looks like machine code, the interpreter is often called an abstract machine or virtual machine. A compiler takes as input a source program and generates as output another (equivalent) program, called a target program. We must distinguish three languages: the source language S (eg. expr) of the input programs, the target language T (eg. texpr) of the output programs, and the implementation language I (eg. SML) of the compiler itself. The compiler does not execute the program; after the target program has been generated it must be executed by a machine or interpreter which can execute programs written in language T. Hence we can distinguish between compile-time (at which time the source program is compiled into a target program) and run-time (at which time the target program is executed on actual inputs to produce a result). At compile-time one usually also performs various so-called well-formedness checks of the source program: are all variables bound? do operands have the correct type in expressions? etc. Variables: free and bound occurrences ------------------------------------- In a language with variable bindings, such as the let-binding of the expression language, one distinguishes bound and free occurrences of a variable. A variable occurrence is bound if it occurs within the scope of a binding for that variable. That is, x occurs bound in the body of this let-binding: let x = 6 in x + 3 end but x occurs free in this one: let y = 6 in x + 3 end and in this one let y = x in y + 3 end and it occurs free (the first time) as well as bound (the second time) in this expression let x = x + 6 in x + 3 end An expression is closed if no variable occurs free in the expression. In many programming languages, programs must be closed: they cannot have unbound (undeclared) names. We can define a function closed1 : expr -> bool to test whether an arithmetic expression is closed. We define a function freevars : expr -> string list to find a list of all those variables occurring free in an expression. Now we can define closedness this way also fun closed2 e = (freevars e = []) Using integer addresses instead of symbolic names ------------------------------------------------- For efficiency, symbolic variable names are replaced by variable addresses (integers) in real machine code, and in most interpreters. To show how this may be done, we define target expressions texpr which use (integer) variable indexes instead of symbolic variable names. A function tcomp : expr -> string list -> texpr compiles an expr to a texpr within a given compile-time environment. The compile-time environment maps the symbolic names to integer variable indexes. In the interpreter teval for texpr, a run-time environment maps integers (variable indexes) to variable values (accidentally also integers in this case). In fact, the compile-time environment in tcomp is just a list of the bound variables. The position of a variable in the list is its binding depth (the number of other let-bindings between the variable occurrence and the binding of the variable). By making sure that the run-time environment has the same structure as the compile-time environment, we can use the binding depth to access the variable at run-time. The integer giving the position is called an offset by compiler writers, and a deBruijn index by theoreticians (in the lambda calculus): the number of binders between this occurrence of a variable, and its binding. The correctness requirements on a compiler can be stated as equivalences such as this one: eval e [] equals teval (tcomp e []) [] which says that * if te = (tcomp e []) is the result of compiling the closed expression e in the empty compile-time environment [], * then evaluation of the target expression te using the teval interpreter and empty run-time environment [] should produce the same result as evaluation of the source expression e using the eval interpreter and an empty environment []. Stack machines for expression evaluation ---------------------------------------- Expressions, and more generally, functional programs, are often evaluated by a stack machine. We show a simple stack machine (an interpreter which implements an abstract machine) for evaluation of expressions in postfix (or reverse Polish) form. Reverse Polish form is named after the Polish philosopher and mathematician Jan Lukasiewicz (1878-1956). Stack machine instructions for an example language without variables (and hence without let-bindings) may be described using this SML type: datatype rinstr = RCstI of int | RAdd | RSub | RMul | RDup | RSwap The state of the stack machine is a pair (C, S) of the control and the stack. The control is the sequence of instructions yet to be evaluated. The stack is a list of values (here integers), namely, intermediate results. The stack machine can be understood as a transition system, described by rules such as these, which say how the machine may go from one state to another: (RCst i :: r, s) ===> (r, i::s) (RAdd :: r, i2::i1::s) ===> (r, (i1+i2)::s) (RSub :: r, i2::i1::s) ===> (r, (i1-i2)::s) (RMul :: r, i2::i1::s) ===> (r, (i1*i2)::s) (RDup :: r, i1::s) ===> (r, i1::i1::s) (RSwap :: r, i2::i1::s) ===> (r, i1::i2::s) A rule represents some possible state transitions. For instance, the second rule says that the machine may go from a state of the form (RAdd :: r, i2::i1::s) to a state of the form (r, (i1+i2)::s) In the former state, RAdd is the current instruction, r is the rest of the instruction sequence, and the integers i2 and i1 are on top of the evaluation stack. In the latter state, r is the list of instructions to be executed, and the sum (i1+i2) now is on the stack top. The rules of the abstract machine are quite easily translated into an SML function reval : rinstr list -> int list -> int The machine terminates when there are no more instructions to execute (or it might have an explicit Stop instruction). The result of a computation is the value on top of the stack. The net effect principle for stack-based evaluation says: regardless what is on the stack already, the net effect of the execution of an instruction sequence generated from an expression e is to push the value of e onto the evaluation stack, leaving the given contents of the stack unchanged. Expressions in postfix or reverse Polish form is used by scientific pocket calculators made by Hewlett-Packard, primarily popular with engineers and scientist. A significant advantage is that one can avoid the silly parentheses found on other calculators; the disadvantage is that the user must `compile' expressions from their usual algebraic notation to stack machine notation. Stack-based (interpreted) languages are widely used. The most notable among them is Postscript (ca 1984), which is implemented in almost all high-end laserprinters. By contrast, Portable Document Format (PDF), also from Adobe Systems, is not a full-fledged programming language. Forth (ca 1968) is another stack-based language, and an ancestor of Postscript. It is used in embedded systems to control scientific equipment, satellites etc. In Postscript one writes 4 5 add 8 mul to compute (4+5)*8, and /x 7 def x x mul 9 add = to bind x to 7 and then compute x*x+9 and print the result. (The `=' function pops a value from the stack and prints it). A name, such as x, that that appears by itself causes its value to be pushed onto the stack. When defining the name, it must be escaped with a slash as in /x. The following defines the factorial function under the name fac: /fac { dup 0 eq { pop 1 } { dup 1 sub fac mul } ifelse } def This is equivalent to SML fun fac n = if n=0 then 1 else n * fac(n-1) Note that the ifelse conditional expression is postfix also, and expects to find three values on the stack: a boolean, a then-branch, and an else-branch. The then- and else-branches are written as code fragments, which in Postscript are enclosed in curly braces. Similarly, a for-loop expects four values on the stack: a start value, a step value, and an end value for the loop index, and a loop body. It repeatedly pushes the loop index and executes the loop body. Thus one can compute and print factorial of 0, 1, ..., 12 this way: 0 1 12 { fac = } for One can use the gs (Ghostscript) interpreter to experiment with Postscript programs. Under Linux, use gs -dNODISPLAY and on the IT-C MS Windows NT student machines, use C:\Aladdin\gs6.01\bin\gswin32 -dNODISPLAY For a more convenient setup, run Ghostscript inside an Emacs shell (under Unix or MS Windows). If prog.ps is a file containing Postscript definitions, gs will execute them on start-up if invoked with gs -dNODISPLAY prog.ps A function definition entered interactively in Ghostscript must fit on one line, a function definition included from a file need not. This example Postscript program prints some text in Times Roman and draws a rectangle. If you send this program to a Postscript printer, it will be executed by the printer's Postscript interpreter, and a sheet of printed paper will be produced: /Times-Roman findfont 25 scalefont setfont 100 500 moveto (Printed by Postscript) show newpath 100 100 moveto 300 100 lineto 300 250 lineto 100 250 lineto 100 100 lineto stroke showpage A much fancier Postscript example (due to Morten Larsen, KVL) is found in file expr/sierpinski.eps; it defines a recursive function that draws a Sierpinski curve. The Postscript Language Reference can be downloaded from http://partners.adobe.com/asn/developer/technotes/postscript.html Compilation of expressions (with variables) for a unified-stack machine ----------------------------------------------------------------------- The datatype sinstr is the type of instructions for a stack machine with variables, where the variables are stored on the evaluation stack: datatype sinstr = SCstI of int (* push integer *) | SVar of int (* push variable from env *) | SAdd (* pop args, push sum *) | SSub (* pop args, push diff. *) | SMul (* pop args, push product *) | SPop (* pop value/unbind var *) | SSwap (* exchange top and next *) Since both stk in reval and env in teval behave as stacks, and because of lexical scoping, they could be replaced by a single stack, holding both variable bindings and intermediate results. The important property is that the binding of a let-bound variable can be removed once the entire let-expression has been evaluated. Thus we define a stack machine seval that uses a unified stack both for storing intermediate results and bound variables. We write a new version scomp to compile every use of a variable into an (integer) offset from the stack top. The offset depends not only on the variable declarations, but also the number of intermediate results currently on the stack. Hence the same variable may be referred to by different indexes at different occurrences. In the expression Let("z", CstI 17, Prim("+", [Var "z", Var "z"])) the two uses of z in the addition get compiled to two different offsets, like this: [SCstI 17, SVar 0, SVar 1, SAdd, SSwap, SPop] The expression (20 + let z = 17 in z + 2 end + 30) is compiled to [SCstI 20, SCstI 17, SVar 0, SCst 2, SAdd, SSwap, SPop, SAdd, SCstI 30, SAdd] Note that the let-binding z = 17 is on the stack above the intermediate result 20, but once the evaluation of the let-expression is over, only the intermediate results 20 and 19 are on the stack, and can be added. Correctness: for an expression e with no free variables, seval (scomp e []) [] equals eval e [] More general functional languages may be compiled to stack machine code with stack offsets for variables. For instance, Moscow ML is implemented that way, with a single stack for temporary results, function parameter bindings, and let-bindings. SML: polymorphic types, type variables, equality type variables, polymorphic datatypes (and possibly higher-order functions, map, foldr).