Note 6, Programming Language Concepts (sestoft@dina.kvl.dk) 2002-03-11 ---------------------------------------------------------------------- We show how to evaluate a C-style imperative language with an interpreter, and present the concepts of expression, variable declaration, assignment, loop, output, variable scope, environment and store, lvalue and rvalue, pointer, array, and pointer arithmetics. A simpler imperative language ----------------------------- We start by considering a simple imperative language (file imp/imp.sml) with variables and expressions: datatype expr = CstI of int | Var of string | Prim of string * expr list and assignment, conditional statements, statement sequences, for-loops, while-loops and a print statement: datatype stmt = Asgn of string * expr | If of expr * stmt * stmt | Seq of stmt list | For of string * expr * expr * stmt | While of expr * stmt | Print of expr Variables are introduced as needed, as in Basic; there are no declarations. Unlike C/C++/Java/C#, the language has no blocks to delimit variable scope, only statement sequences. For-loops are as in Pascal or Basic, not C/C++/Java/C#: a for loop has the form for i = startval to endval do stmt where start and end values are given for the controlling variable i, and the controlling variable cannot be changed inside the loop. The store is naive: it simply maps variable names to values. This is similar to a functional language, but completely unrealistic for imperative languages. The structure of the store in imperative languages -------------------------------------------------- Real imperative languages such as C, Pascal and Ada, and imperative object-oriented languages such as C++, Java, C# and Ada95, have a more complex state (or store) model than functional languages: * An environment maps variable names (x) to locations (0x34B2) * An updatable store maps locations (0x34B2) to values (117). It is useful to distinguish two kinds of values in such languages. When (e.g.) a variable x or array element a[i] occurs as the target of an assignment statement: x = e or as the operand of an increment operator (in C/C++/Java/C#): x++ or as the operand of an address operator (in C/C++/C#; see below): &x then we use the lvalue (`left hand side value') of the variable or array element. The lvalue is the location (or address) of the variable or array element in the store. Otherwise, when the variable x or array element a[i] occurs in an expression, we use its rvalue (`right hand side value'). The rvalue is the value stored at the variable's location in the store: x + 7 Only expressions that have a location in the store can have an lvalue. Thus in C/C++/Java/C# this expression makes no sense: (8 + 2)++ because the expression (8 + 2) does not have an lvalue. In other words, the environment maps names to lvalues; the store maps lvalues to rvalues. When we later study the compilation of imperative programs to machine code, we shall see that the environment exists only at compile-time, when the code is generated, and the store exists only at run-time, when the code is executed. In all imperative languages the store is single-threaded: at most one copy of the store needs to exist at a time. That is because we never need to look back (for instance, by discarding all changes made to the store since a given point in time). Parameter passing mechanisms ---------------------------- In a declaration of a procedure (or function or method) void p(int x, double y) { ... } the x and y are called formal parameters or just parameters. In a call to a procedure (or function or method) p(e1, e2) the expressions e1 and e2 are called actual parameters, or argument expressions. When executing a procedure call p(e1, e2) in an imperative language, the values of the argument expressions must be bound to the formal parameters x and y somehow. This so-called parameter passing can be done in several different ways: * Call-by-value: a copy of the argument expression's value (rvalue) is made in a new location, and the new location is passed to the procedure. Thus updates to the corresponding formal parameter does not affect the actual parameter (argument expression). * Call-by-reference: the location (lvalue) of the argument expression is passed to the procedure. Thus updates to the corresponding formal parameter will affect the actual parameter. Note that the actual parameter must have an lvalue. Usually this means that it must be a variable or an array element (or a field of an object or structure). Call-by-reference useful for returning multiple results from a procedure. It is also useful for writing recursive functions that modify trees, so some binary tree algorithms are more elegant in Pascal, C, C++ or C# than in Java. * Call-by-value-return: a copy of the argument expression's value (rvalue) is made in a new location, and the new location is passed to the procedure. When the procedure returns, the current value in that location is copied back to the argument expression (if it has an lvalue). Pascal, C++, C#, and Ada permit both call-by-value and call-by-reference. Fortran (at least some versions) uses call-by-value-return. C, Java and SML permit only call-by-value, but in C one can pass variable x by reference just by passing the address &x of x and making the corresponding formal parameter xp be a pointer. Note that Java does not copy objects and arrays when passing them as parameters, because it passes (and copies) only references to objects and arrays (see Java Precisely, for instance). When passing an object by value in C++, the object gets copied. This is often not what is intended; for instance, if the object being passed is a file descriptor, the result is unpredictable. Here are a few examples (in C#, see file imp/Parameters.cs) to illustrate the difference between call-by-value and call-by-reference parameter passing. The method swapV uses call-by-value: static void swapV(int x, int y) { int tmp = x; x = y; y = tmp; } Putting a = 11 and b = 22, and calling swapV(a, b) has not effect at all on the values of a and b. In the call, the value 11 is copied to x, and 22 is copied to y, and they are swapped so that x is 22 and y is 11, but that does not affect a and b. The method swapR uses call-by-reference: static void swapR(ref int x, ref int y) { int tmp = x; x = y; y = tmp; } Putting a = 11 and b = 22, and calling swapR(ref a, ref b) will swap the values of a and b. In the call, parameter x is made to point to the same address as a, and y to the same as b. Then the contents of the locations pointed to by x and y are swapped, which swaps the values of a and b also. The method square uses call-by-value for its i parameter and call-by-reference for its r parameter. It computes i*i and assigns the result to r and hence to the actual argument passed for r: static void square(int i, ref int r) { r = i * i; } After the call square(11, ref z), variable z has the value 121. Compare with the C example in file imp/ex5.c: it passes an int pointer r by value instead of passing an integer variable by reference. The C programming language -------------------------- The C programming language (Kernighan and Ritchie, USA, early 1970s) is widely used, and has influenced the syntax of core C++, Java, and C#. The C programming language descends from B (designed by Brian Kernighan and Ken Thompson at MIT and Bell Labs 1971), which descends from BCPL (designed by Martin Richards at Cambridge and MIT, 1967), which descends from CPL, a research language designed by Christopher Strachey and others (at Cambridge, early 1960s?). The ideas behind CPL also influenced other languages. The primary aspects of C modelled here are functions (procedures), parameter passing, arrays, pointers, and pointer arithmetics. The language presented here has no type checker (so far) and therefore is quite close to B, which was untyped. Integers, pointers and arrays in C ---------------------------------- A variable i of type integer may be declared as follows: int i; This reserves storage for an integer, and introduces the name i for that storage location. The integer is not initialized to any particular value. A pointer p to an integer may be declared as follows: int *p; This reserves storage for a pointer, and introduces the name p for that storage location. It does not reserve storage for an integer. The pointer is not initialized to any particular value. A pointer is a store address, essentially. The integer pointed to by p (if any) may be obtained by dereferencing the pointer: *p An attempt to dereference an uninitialized pointer is likely to cause a Segmentation fault (or Bus error, or General protection fault), but it may instead just return an arbitrary value, which can give nasty surprises. A dereferenced pointer may be used as an ordinary value (an rvalue) as well as the destination of an assignment (an lvalue): i = *p + 2; *p = 117; A pointer to an integer variable i may be obtained by using the address operator (&): p = &i; This assignment makes *p an alias for the variable i. The address operator * and the address operator & are inverses, so *&i is the same as i, and &*p is the same as p. An array ia of 10 integers can be declared as follows: int ia[10]; This reserves a block of storage with room for 10 integers, and introduces the name ia for the storage location of the first of these integers. Thus ia is actually a pointer to an integer. The elements of the array may be accessed by the subscript operator ia[...], so ia[0] refers to the location of the first integer; thus ia[0] is the same as *ia. In general, since ia is a pointer, the subscript operator is just an abbreviation for dereferencing in combination with so-called pointer arithmetics. Thus ia[k] is the same as *(ia+k) where (ia+k) is simply a pointer to the k'th element of the array, obtained by adding k to the location of the first element, and clearly *(ia+k) is the contents of that location. Type declarations in C ---------------------- In C, type declarations for pointer and array types have a tricky syntax, where the type of a variable x surrounds the variable name: int x x is an integer int *x x is a pointer to an integer int x[10] x is an array of 10 integers int x[10][3] x is an array of 10 arrays of 3 integers int *x[10] x is an array of 10 pointers to integers int *(x[10]) x is an array of 10 pointers to integers int (*x)[10] x is a pointer to an array of 10 integers int **x x is a pointer to a pointer to an integer The C type syntax is so obscure that there is a standard Unix/Linux program `cdecl' to help explain it. For instance, cdecl explain "int *x[10]" prints declare x as array 10 of pointer to int By contrast, cdecl explain "int (*x)[10]" prints declare x as pointer to array 10 of int The expression syntax for pointer dereferencing and array access is consistent with the declaration syntax, so if ipa is declared as int *ipa[10] then *ipa[2] means the (integer) contents of the location pointed to by element 2 of array ipa, that is, *(ipa[2]), or in pure pointer notation, *(*(ipa+2)). Similarly, if iap is declared as int (*iap)[10] then (*iap)[2] is the (integer) contents of element 2 of the array pointed to by iap, or in pure pointer notation, *((*iap)+2). Beware that the C compiler will not complain about the expression *iap[2] which means something quite different, and most likely not what one intends. It means *(*(iap+2)), that is, add 2 to the address iap, take the contents of that location, use that contents as a location, and get its contents. This may cause a Segmentation fault, or return arbitrary garbage. This is one of the great risks of C: neither the type system (at compile-time) nor the runtime system provide much protection for the programmer. File imp/c.sml presents an interpreter for uC, a subset of C with pointers, one-dimensional arrays, address arithmetic, and functions (procedures) without return. File imp/Absyn.sml, imp/grammar.txt, imp/Cpar.grm, imp/Clex.lex, imp/parse.sml present abstract syntax, and lexer and parser specifications for uC. We do not model the return statement in functions because it represents a way to abruptly terminate the execution of a sequence of statements. This is easily implemented by translation to a stack machine, or by using a continuation-based interpreter, but it is rather cumbersome to encode in a direct-style interpreter. One could implement return and local (block) scope as follows. Represent the remainder of the current method (that is, the statements yet to be executed) by a local continuation, which is a list of pairs of a list of statements and an environment. Each pair (stmts, env) represents the remainder of a given enclosing scope. Thus one can exit from an inner scope (block statement) to an outer scope by discarding the first element of the local continuation, and one can implement function return by discarding the current local continuation. The global continuation, then, is a list of local continuations, each corresponding to a procedure invocation. Return discards the current locon; Abort discards the global continuation. As a further refinement, the local continuation may be structured to allow implementation of (labelled) Break and Continue. In that case the locon should be a stack of lists of statements, distinguishing switch, for, while, do-while (with labels). A micro-C interpreter --------------------- File imp/c.sml presents an interpretive implementation of a tiny subset of the C programming language. The interpreter's state is split into environment and store. Variables must be explicitly declared (as in C), but there is no type checking (as in B). The scope of a variable extends to the end of the innermost block enclosing its declaration. In the interpreter, the environment is used to keep track of variable scope and the next available store location, and the store keeps track of the locations' current values. Later we shall compile the same language to bytecode for a stack machine, and even bytecode for the Java virtual machine. Notes on Stracheys `Fundamental concepts ...' --------------------------------------------- Christopher Strachey's lecture notes on `Fundamental Concepts in Programming Languages' from a Copenhagen Summer School on Programming in 1967 were circulated in manuscript and highly influential, although they were not formally published until 25 years after Strachey's death. They are especially noteworthy for introducing concepts such as lvalue, rvalue, ad hoc polymorphism, and parametric polymorphism, that shape our ideas of programming languages even today. Moreover, a number of the language constructs discussed in the notes made their way into CPL, and hence into BCPL, B, C, C++, Java, and C#. * The CPL assignment: i := (a > b -> j,k) naturally corresponds to this in C, C++, Java, C#: i = (a > b ? j : k) Also the CPL assignment: (a > b -> j, k) := i can be expressed in GNU C: (a > b ? j : k) = i and it can be encoded using pointers and the dereferencing and address operators in all versions of C: *(a > b ? &j : &k) = i In Java and standard (ISO) C, conditional expressions cannot be used as lvalues. In fact the GNU C compiler (gcc -c -pedantic assign.c) says: warning: ISO C forbids use of conditional expressions as lvalues * The CPL definition in section 2.3: let q equals-tilde p which defines the lvalue of q to be the lvalue of p, has no counterpart in other languages as far as I know. But call-by-reference parameter passing void m(ref int q) { ... } ... m(ref p) ... does exactly that, when q is a formal parameter and p is the corresponding argument expression: the lvalue of q is defined to be the lvalue of p. * The semantic functions L and R in section 3.3 are applied only to an expression epsilon and a store sigma, but should in fact be applied also to an environment, as in our imp/c.sml, if the details are to work out properly. * Note that the CPL block delimiters `paragraph' and `strike-through-paragraph' in section 3.4.3 are the grandparents (via a BCPL and B) of C's block delimiters { and }. The latter are used also in C++, Java, and C#. * The discussion in section 3.4.3 (of the binding mechanism for the free variables of a function) can appear rather academic until one realizes that in SML a function closure always stores the rvalue of free variables, whereas in Java an object stores essentially the lvalue of fields that appear in a method. In Java a non-static method m can have as `free variables' the fields of the enclosing object, and the methods refer to those fields via the object reference this. As a consequence, subsequent assignments to the fields affect the (r)value seen by the field references in m. Moreover, when a method mInner is declared (in Java) in a local inner class CInner inside a method mOuter, then mInner can refer to the variables and parameters of method mOuter only if those variables and parameters are declared final (not updatable): class COuter { void mOuter(final int p) { final int q = 20; class C { void mInner() { ... p ... q ... } } } } In reality the rvalue of the variables and parameters is passed, but when the free variables are non-updatable, there is no observable difference between passing the lvalue and the rvalue. Thus the purpose of the `final' restriction in Java is to make free variables from the enclosing method appear to behave the same as free fields from the enclosing object. Since C# does not have local classes at all, this question does not arise. * The type declaration in section 3.7.2 is quite cryptic, but roughly corresponds to this declaration in SML: datatype LispList = LAtom of Atom | LCons of Cons and Atom = Atom of { PrintName : string, PropertyList : Cons } and Cons = CNil | Cons of { Car : LispList, Cdr : Cons } or these declarations in Java (where the Nil pointer case is implicit): abstract class ListList {} class Cons extends LispList { LispList Car; Cons Cdr; } class Atom extends LispList { String PrintName; Cons PropertyList; } In addition, constructors and fields selectors should be defined. * Note that section 3.7.6 describes C and C++ pointers: Follow[p] is just *p, and Pointer[x] is &x. * The `load-update-pairs' mentioned in sections 4.1 are called properties nowadays in Common Lisp Object System, Visual Basic, and C#: get-methods and set-methods. Literature ---------- * Brian W Kernighan and Dennis M Ritchie: The C Programming Language, Second edition, Prentice-Hall 1988. Read pages 93-107 from Chapter 5. * Christopher Strachey: Fundamental Concepts in Programming Languages, 1967. Reprinted in Higher-order and symbolic computation 13 (2000) 11-49. * Dennis M Ritchie: The Development of the C Language. Second History of Programming Languages Conference, Cambridge, Massachusetts, April 1993. * Various materials on the history of B (including a wonderfully short User Manual from 1972) and C may be found from Dennis Ritchie's home page: http://www.cs.bell-labs.com/who/dmr/ * A modern portable implementation of BCPL -- which must otherwise be characterized as a dead language -- is available from Martin Richards's homepage: http://www.cl.cam.ac.uk/~mr/