Paul Haahr / Essays

Notes on Programming Langugage Design (1994)

Paul Haahr
22 January 1994 (written)
7 November 2001 (released on the web)


I wrote this essay as private email to a friend -- the ``you'' in the text -- who was embarking on a programming language design project, long since abandoned. I've reformatted in HTML, fixed a couple of typos, but otherwise left the content exactly as I found it.

I'm a little embarrassed about the simplicity and superficiality of these recommendations, but I like how directly it caught my thinking at the time: before I done any serious work on Dylan, before I'd heard of Java, and before I'd written a realistic compiler.

These are some notes I've put together as the user of many programming languages, and the reader of descriptions of many more. I've never designed a language well myself, so take everything with a grain of salt.

  1. To garbage collect, or not to garbage collect? That is the most important question. We've talked a lot about this, but which ever way you decide, you'll have to live with the consequences of your decision: either you will have some measurable loss in efficiency or the language will be unsafe.

    [By the way, I said earlier that being able to take the address of a variable did not necessarily make the language unsafe. I was wrong. If you take the address of a stack variable but dereference it after the procedure which owns the variable has exited, you are unsafe. That's why pascal had reference parameters but no address-of operation. This is probably the right thing in a garbage collected language.]

  2. Minimize the number of ``syntactic domains.'' These are the little languages within the language for doing different types of things. In C, I'd identify at least three: the type and declaration syntax, the expression syntax, and the statement syntax. (The preprocessor may be a fourth, but nobody really counts that as part of the language, at least when they're trying to be fair.) Scheme, on the other hand, has one such domain: s-exprs.

    It's almost certain that you need a syntactic domain for expressions. Recently, I've become more and more that the statement/expression distinction is unnecessary and gets in the way of doing the things you want to do -- declare a local variable inside an expression or some such. On the other hand, some questions are raised by not having statements. For example, what value does a loop return? (IMHO, void, but it's not that important.)

    Another example is goto: how do you express it in an expression-only language. Dylan (w/infix syntax) has a nice approach, the block statement, which covers some cases of goto, where most of the others are nominally covered by recursion. The block statement is ([] indicate optional things)

       block ([exit-function])
          e1;
          e2;
          e3[;]
       end [block]
    
    An arbitrary number of expressions are allowed, of course. The block statement returns the value of the last expression under normal circumstances. If the exit-function is named (the parens are not optional), then a variable of that name is bound inside the block to a function which, if called, returns from the block, returning its argument as the value of the block. The function can be passed down the call stack, but passing it up is an error. (It's effectively a restricted form of call/cc.)

    Types and declarations may deserve to be a separate syntactic domain, especially in something like C where types are not first class and form a separate semantic domain from values. However, one could imagine types being declared with a syntax reminiscent of c exprssions. For example

       main: func(int, tuple(int, pointer(pointer(char))));
    
    instead of
       int main(int, char **);
    

  3. Minimize the number of ``semantic domains.'' These are the different kinds of ``things'' in the language. (I would say value or object instead of thing, but those words carry too much meaning.) In Scheme, I would identify one semantic domain: values. In C, there are at least two (types and values), and possibly more (goto labels, maybe include files?). There's good reason to have a stratified type system, where all computations involving types are guaranteed to complete at compile time (static typing, to first order), but all sorts of things (exceptions, module systems, processes) are often segregated from values, and I can't understand why.

  4. Make as many things as possible first-class objects. I can't decide this is the same issue as the previous one. Basically, the question I would ask myself is, whenever I introduce a new thing in the language that is named, can it be assigned to variables, passed to and returned from functions, etc. If not, why not?

    For example, in cedar, modula, and oberon, modules are external to programs: you declare what modules you import, what your module name is, and so on, but is impossible to bind a module to a variable or ask what is the type of this module. (You can ask about its interface, but you can't declare a variable -- as opposed to a module -- which has that interface.) At the same time, modules are not much different from structs.

    By the way, this applies to types as well as values. Are there things which are best described as types which are not types. For example, eiffel equates classes and modules. All types can be declared inside a file except the most important kind of type in the language, classes, which must be files unto themselves. That is, every file is a class and every class is a file. (It's actually a more workable concept than I'm saying here, but Eiffel too much decides for you how it wants you to work.)

    In some ways, this is the ``objects all the way down'' question in object oriented languages. In some O-O languages (Smalltalk, Self, Eiffel v2 and later, Dylan), every value is an object; in others (C++, CLOS, early Eiffel), things like ints and floats aren't objects, they're a separate form of type. You can do some things with types, other things with classes, and they intersect in some places.

    A last example: C's goto labels are segregated from every other value in the language. Why can't I pass one to a function or store it in a variable? (c.f. Dylan's block, above.) GCC extends C by letting you use labels as values.

  5. Don't be afraid to do with libraries what other languages build in. For example, even when designing a language around concurrency, it's ok to use function calls to implement multiple processes rather than special syntax and funky types. If your language is polymorphic, something like a chan(t) is just a polymorphic type and doesn't need any special handling. (Of course, the functions for actually doing concurrency operations have to be special, but that doesn't mean they can't appear to be like other functions.)

  6. Think about whether you want to have explicit pointers at all, and whether aggregate structures (structs, unions, objects, etc.) are immediate or always pointers. Some things (inheritance, polymorphism) are much easier if you encode all structs, etc, as pointers rather than direct structures. I don't know of any hard data, but my intuition tells me that explicit pointers and structs not as references (ala C) is more efficient, but there are strong linguistic arguments in favor of the ``everything is a pointer'' approach.

  7. Completeness is not as important as I think it is. All modern programming languages are Turing complete, so, in practice, there is no computation they can not do. Given a simple procedural interface to operating system services -- e.g., section 2 of the Unix manual -- programs in a language have the same ability as those of any other language to control interaction with the outside world.

    What's left is a fuzzy notion of completeness, that some things are very convenient to do in one language and are extremely verbose in another. That's a fact of life. There will always be some operation which your language takes more characters and more work on behalf of the programmer to do that C, or ML, or Eiffel, or Prolog, or APL, will do naturally and as a built in operation.


Now, here's a list of nice features that are not essential to design, but are things I like. Some are completely gratuitous, none require much work to do. (except perhaps (a), which is the most important.)

  1. Proper tail calling guaranteed, as in Scheme.

  2. Names with punctuation in them, especially -, !, and ?. Of course, to do this you have to separate identifiers from operators with whitespace, which I know you don't like as a dictate.

  3. Tuples in the language, especially so you can do

       (x, y) := (y, x)
    
    and
       (quo, rem) := divmod(a, b)
    
    Why do functions take n values but return 1? I think it's mainly syntax.

  4. A with (Pascal), open (ML), or expose (Haahr) statement, to augment the current scope with members from an aggregate. For example,

       typedef struct { double x, y, z; } P3;
       double dist(P3 p) {
          expose (p)
             return sqrt((x * x) + (y * y) + (z * z));
       }
    
    would replace
       double dist(P3 p) {
          return sqrt((p.x * p.x) + (p.y * p.y) + (p.z * p.z));
       }
    

    Note that expose would evalue its argument only once.

  5. Structs that contain types. For example,

       typedef struct {
          typedef struct { ... } Path;
          typedef struct { ... } File;
          Path *root;
          Path *(*walk)(Path *from, char *name);
          File *(*open)(Path *path);
          ...
       } Filesystem;
    

    so I could declare something to be a Filesystem.Path. (Note that this is relevant to using structs as modules.)

  6. An exception mechanism and unwind-protect. Syntax for this is usually ugly, but remember, make exceptions just a normal type and the semantics will stay pretty clean. (On the other hand, exception handling can be added late to a language w/o disturbing the foundation too much. Of all the things you do, design the exceptions last, IMHO.)