Note 7, Programming Language Concepts (sestoft@dina.kvl.dk) 2002-03-19 ---------------------------------------------------------------------- In seminar 2 we considered a very simple stack-based abstract machine for the evaluation of expressions with variables and variable bindings. Here we continue that work, and extend the abstract machine so that it can execute programs compiled from an imperative language (micro-C). We also write a compiler from the imperative programming language micro-C to this abstract machine. Thus the phases of compilation and execution are * lexing from characters to tokens * parsing from tokens to abstract syntax tree * (static checks) check types, check that variables are declared, ... * code generation from abstract syntax to symbolic instructions * code emission from symbolic instructions to numeric instructions * execution of the numeric instructions by an abstract machine An abstract stack machine ------------------------- We define a stack-based abstract machine for execution of simple imperative programs, actually micro-C programs. The state of the abstract machine has the following components: * a program p: an array of instructions. Each instruction is represented by a number 0, 1, ... possibly with an operand in the next program location. The array is indexed by the numbers (code addresses) 0, 1, ... as usual. * a program counter pc indicating the next instruction to execute * a stack s of integers, indexed by the numbers 0, 1, ... * a stack pointer sp, pointing at the stack top; the next available stack position is s[sp+1] * a base pointer bp, pointing at the base of the current activation record (or stack frame); this is where the first parameter or variable of the current function is stored The abstract machine is implemented in Java; see file Machine.java. To run a compiled program file using this abstract machine, execute: java Machine ... where the ... are parsed as integers and given to the program's initial function, that is, the `main' function of the micro-C program. To print a trace of every instruction executed (very useful for debugging of the abstract machine or the compiler) execute it as follows: java Machinetrace ... In both cases, the is a text file containing two code sequences, that is, sequences of integers. The code from address 1 onwards must set aside space for the global variables (those declared outside functions in micro-C programs) and initialize them; typically thi code ends with STOP. The code from the address indicated by p[0] and onwards is the program proper. Here is an example program (file imp/prog0) for the abstract machine. The program proper starts at address p[0], that is, 2; the global initialization prelude consists of the single instruction STOP. The program prints the infinite sequence of numbers n, n+1, ..., where n is given on the command line: 2 24 22 0 1 1 16 2 In symbolic machine code that is: 2; STOP; PRINT; 1; ADD; GOTO 2 Here is another program (in file imp/prog1) that loops 20 million times: 2 24 0 20000000 16 9 0 1 2 9 18 6 24 or, in symbolic machine code: 2; STOP; 20000000; GOTO 9; 1; SUB; DUP; IFNZRO 6; STOP Loading and interpreting this takes less than 2.5 seconds with Sun JDK 1.3 HotSpot on a 600 MHz AMD K7 running Linux. The equivalent micro-C program (file ex8.c) compiled by the compiler presented in this lecture is 4 times slower than the above hand-written machine code. The abstract machine instruction set ------------------------------------ The abstract machine has 25 instructions. Most instructions are single-word instructions consisting of the instruction code only, but some instructions take one or two or three integer arguments, representing constants (denoted by m, n) or program addresses (denoted by a). The execution of an instruction has an effect on the stack, on the program counter, and on the console: the program may print something. The stack effect of each instruction is shown below, as a transition s1 ==> s2 from the stack s1 before instruction execution to the stack s2 after the instruction execution. In both stacks, the stack top is on the left, and colon (:) is used to separate stack elements: CST i s ==> i:s Push integer i onto stack ADD i2:i1:s ==> (i1+i2):s Add SUB i2:i1:s ==> (i1-i2):s Subtract MUL i2:i1:s ==> (i1*i2):s Multiply DIV i2:i1:s ==> (i1/i2):s Divide (integer quotient) MOD i2:i1:s ==> (i1%i2):s Modulo (remainder) EQ i2:i1:s ==> (i1=i2):s Equality test LT i2:i1:s ==> (i1 ~v:s Logical negation DUP v:s ==> v:v:s Duplicate top element SWAP v2:v1:s ==> v1:v2:s Swap two topmost elements LDI i:s ==> s[i]:s Load indirect STI v:i:s ==> v:s[v/i] Store indirect GETBP s ==> bp:s Load base pointer GETSP s ==> sp:s Load stack pointer INCSP m s ==> vm:...:v1:s Increment stack pointer, where the vm...v1 are arbitrary and m >= 0 INCSP m v(-m):...:v1:s ==> s Decrement stack pointer, where m < 0 GOTO a s ==> s Go to instruction at address a IFZERO a v:s ==> s If v is zero, then go to instruction at address a IFNZRO a v:s ==> s If v is nonzero, then go to instruction at address a CALL m a vm:...:v1:s ==> vm:...:v1:bp:r:s Call function with m arguments: push return address r (the address after the CALL instruction), push the current base pointer bp, set bp to address of v1, and go to instruction at address a TCALL m n a vm:..:v1:un:..:u1:b:r:s ==> vm:...:v1:b:r:s Tail-call function with m arguments: after removing n old variable values un:...:u1 from the stack, go to instruction at address a RET m v:vm:...:v1:b:r:s ==> v:s Return from function: remove m old variables from the stack, set the base pointer bp equal to b, and go to the instruction at the return address r PRINTI v:s ==> v:s Print v as an integer (in decimal format) PRINTC v:s ==> v:s Print the ASCII character with character code v STOP s ==> _ Halt the program execution Notation: * ~i is logical negation on integers: 1 if i is 0, and 0 otherwise * s[i] is the contents of the i'th stack position, where 0 is the bottom (right-most) stack position. * s[v/i] is the stack in which position i has been set to value v * sp is the address of the stack top (that is, stack length minus 1) * sp is the base pointer; when executing a function that is the address of the function's first variable or parameter Some instruction sequences are equivalent to others; this fact will be used to improve the compiler (in a later seminar). Alternatively, one could use the equivalences to reduce the instruction set of the abstract machine, which would simplify the machine but slow doen the execution of programs. For instance, instruction NOT could be simulated by the sequence (0; EQ), and each of the instructions IFZERO and IFNZRO can be simulated by NOT and the other one. Here are some example equivalences: NOT; NOT === 0; ADD === 0; SUB === 1; MUL === 1; DIV === 0; EQ === NOT 0; IFZERO a === GOTO a 0; IFNZRO a === NOT; IFZERO a === IFNZRO a NOT; IFNZRO a === IFZERO a INCSP 0 === INCSP m1; INCSP m2 === INCSP (m1+m2) INCSP m1; RET m2 === RET (m2-m1) The symbolic machine code ------------------------- To simplify code generation in our compilers, we define a symbolic machine code as an SML datatype (file Machine.sml), and also provide SML functions to emit a list of symbolic machine instructions to a file as numeric instruction codes. In addition, we permit the use of symbolic labels instead of absolute code addresses. The code emitter replaces the labels by absolute code addresses. Thus the above program prog0 could be written in symbolic form as follows, as a pair of the number of globals (0) and a list of symbolic instructions: (0, [Label (Lab "L1"), PRINTI, CSTI 1, ADD, GOTO (Lab "L1")]) Note that Label is a pseudo-instruction; it serves only to indicate a position in the code and gives rise to no instruction in the numeric code: 0 21 0 1 1 15 1 Abstract machines, or virtual machines, are very widely used to for implementing or describing programming languages, including Postscript, Forth, Visual Basic, Java Virtual Machine, and Microsoft IL. More on this in a later seminar. A compiler from micro-C to abstract machine code ------------------------------------------- The compiler (in file imp/comp.sml) compiles micro-C programs into sequences of instructions for this abstract machine. The compiler works in the following stages (function cProgram does stages 1 and 2, and function compile2file does stage 3): * Stage 1: Find all global variables and generate code to initialize them. * Stage 2: Compile micro-C abstract syntax with symbolic variable and function names to symbolic abstract machine code with numeric addresses for variables, and symbolic labels for functions. One list of symbolic instructions is created for each function. * Stage 3: Join the global initialization code lists of symbolic instructions with symbolic labels and emit the result to a text file as numeric machine instructions (using absolute code addresses instead of labels). Expressions are compiled to reverse Polish notation as before, and are evaluated on the stack. Function arguments and local variables (integers, pointers and arrays) are all allocated on the stack, and are accessed relative to the stack top (using the stack top register sp). Global variables are allocated at the bottom of the stack and are accessed using absolute addresses into the stack. The stack consists of * a block of global variables, including global arrays * a sequence of stack frames for active function calls A stack frame (or activation record) has the following structure: * return address * the old base pointer (that is, the calling function's base pointer) * the function's parameters * local variables and intermediate results of expressions (temporary values) The offset of a local variable relative to the base pointer is constant, and is recorded in the compile-time environment. The main compilation functions are: cProgram Compile an entire micro-C program to two instruction sequences: one for initialization of global variables, and one for the program proper. The latter will consist of a call to the `main' function, followed by code for all functions, including `main'. cStmt Compile a micro-C statement into a sequence of instructions. The compilation takes place in a compile-time environment which maps global variables to absolute addresses in the stack (at the bottom of the stack), and maps local variables to offsets from the base pointer of the current stack frame. Also, a global function environment maps function names to symbolic labels. cStmtOrDec Compile a statement or declaration (as found in a statement block) to a sequence of instructions, either for the statement or for allocation of the declared (int or array or pointer) variable. Return the resulting frame depth and a possibly extended environment. cExpr Compile a micro-C expression into a sequence of instructions. The compilation takes place in a compile-time environment. Net effect principle: If the compilation (cExpr e env) of expression e returns the instruction sequence instrs, then the execution of instrs will leave the value of expression e on the stack top (and thus extend the current stack frame with one element). cAccess Compile an access (variable, pointer dereferencing, or array indexing) into a sequence of instructions, again relative to a compile-time environment. cExprs Compile a list of expressions into a sequence of instructions. allocate Compile a variable and declared type to a sequence of instruction for allocating the variable. Return the resulting frame depth and a possibly extended environment. [There's much more to say, but not now]. Compilation of the return statement ----------------------------------- The compilation of a return statement with an argument expression: return e; is straightforward: we generate code to evaluate e and then Ret m where m is the number of temporaries on the stack. If the corresponding function call is part of an expression in which the value is used, then the value will be on the stack top as expected. If the call is part of an expression statement f(...); then the value is discarded by an INCSP ~1 instruction or similar. In a void function, a return statement has no argument expression return; Also, the function may return simply by reaching the end of the function body. This kind of return can be compiled to RET (m-1), where m is the number of temporary values on the stack. This has the effect of leaving a junk value on the stack top RET (m-1) vm:...:v1:b:r:s ==> v1:s Note that in the extreme case where m=0, the junk value will be the old base pointer b, which at first seems completely wrong: RET -1 b:r:s ==> b:s However, a void function f may be called only by an expression statement f(...);, so this junk value is ignored and cannot be used by the calling function.