|
I wrote this essay as private email to a friend -- the ``you'' in the
text -- who was embarking on a programming language design project, long
since abandoned. I've reformatted in HTML, fixed a couple of typos, but
otherwise left the content exactly as I found it.
I'm a little embarrassed about the simplicity and superficiality of
these recommendations, but I like how directly it caught my thinking at
the time: before I done any serious work on Dylan, before I'd heard of
Java, and before I'd written a realistic compiler.
| |
These are some notes I've put together as the user of many programming
languages, and the reader of descriptions of many more. I've never
designed a language well myself, so take everything with a grain of
salt.
-
To garbage collect, or not to garbage collect? That is the most
important question. We've talked a lot about this, but which ever way
you decide, you'll have to live with the consequences of your decision:
either you will have some measurable loss in efficiency or the language
will be unsafe.
[By the way, I said earlier that being able to take the address of a
variable did not necessarily make the language unsafe. I was wrong. If
you take the address of a stack variable but dereference it after the
procedure which owns the variable has exited, you are unsafe. That's
why pascal had reference parameters but no address-of operation. This
is probably the right thing in a garbage collected language.]
-
Minimize the number of ``syntactic domains.'' These are the little
languages within the language for doing different types of things. In
C, I'd identify at least three: the type and declaration syntax, the
expression syntax, and the statement syntax. (The preprocessor may be a
fourth, but nobody really counts that as part of the language, at least
when they're trying to be fair.) Scheme, on the other hand, has one
such domain: s-exprs.
It's almost certain that you need a syntactic domain for expressions.
Recently, I've become more and more that the statement/expression
distinction is unnecessary and gets in the way of doing the things you
want to do -- declare a local variable inside an expression or some
such. On the other hand, some questions are raised by not having
statements. For example, what value does a loop return? (IMHO,
void, but it's not that important.)
Another example is goto: how do you express it in an expression-only
language. Dylan (w/infix syntax) has a nice approach, the
block statement, which covers some cases of goto, where most of
the others are nominally covered by recursion. The block statement is
([] indicate optional things)
block ([exit-function])
e1;
e2;
e3[;]
end [block]
An arbitrary number of expressions are allowed, of course. The block
statement returns the value of the last expression under normal
circumstances. If the exit-function is named (the parens are not
optional), then a variable of that name is bound inside the block to a
function which, if called, returns from the block, returning its
argument as the value of the block. The function can be passed down the
call stack, but passing it up is an error. (It's effectively a
restricted form of call/cc.)
Types and declarations may deserve to be a separate syntactic domain,
especially in something like C where types are not first class and form
a separate semantic domain from values. However, one could imagine types
being declared with a syntax reminiscent of c exprssions. For example
main: func(int, tuple(int, pointer(pointer(char))));
instead of
int main(int, char **);
-
Minimize the number of ``semantic domains.'' These are the different
kinds of ``things'' in the language. (I would say value or object
instead of thing, but those words carry too much meaning.) In Scheme, I
would identify one semantic domain: values. In C, there are at least
two (types and values), and possibly more (goto labels, maybe include
files?). There's good reason to have a stratified type system, where
all computations involving types are guaranteed to complete at compile
time (static typing, to first order), but all sorts of things
(exceptions, module systems, processes) are often segregated from
values, and I can't understand why.
-
Make as many things as possible first-class objects. I can't decide this
is the same issue as the previous one. Basically, the question I would ask
myself is, whenever I introduce a new thing in the language that is named,
can it be assigned to variables, passed to and returned from functions, etc.
If not, why not?
For example, in cedar, modula, and oberon, modules are external to
programs: you declare what modules you import, what your module name is,
and so on, but is impossible to bind a module to a variable or ask what is
the type of this module. (You can ask about its interface, but you can't
declare a variable -- as opposed to a module -- which has that interface.)
At the same time, modules are not much different from structs.
By the way, this applies to types as well as values. Are there things
which are best described as types which are not types. For example,
eiffel equates classes and modules. All types can be declared inside a
file except the most important kind of type in the language, classes,
which must be files unto themselves. That is, every file is a class and
every class is a file. (It's actually a more workable concept than I'm
saying here, but Eiffel too much decides for you how it wants you to
work.)
In some ways, this is the ``objects all the way down'' question in
object oriented languages. In some O-O languages (Smalltalk, Self,
Eiffel v2 and later, Dylan), every value is an object; in others (C++,
CLOS, early Eiffel), things like ints and floats aren't objects, they're
a separate form of type. You can do some things with types, other
things with classes, and they intersect in some places.
A last example: C's goto labels are segregated from every other value
in the language. Why can't I pass one to a function or store it in a
variable? (c.f. Dylan's block, above.) GCC extends C by letting you
use labels as values.
-
Don't be afraid to do with libraries what other languages build in.
For example, even when designing a language around concurrency, it's ok
to use function calls to implement multiple processes rather than
special syntax and funky types. If your language is polymorphic,
something like a chan(t) is just a polymorphic type and doesn't
need any special handling. (Of course, the functions for actually doing
concurrency operations have to be special, but that doesn't mean they
can't appear to be like other functions.)
-
Think about whether you want to have explicit pointers at all, and
whether aggregate structures (structs, unions, objects, etc.) are
immediate or always pointers. Some things (inheritance, polymorphism)
are much easier if you encode all structs, etc, as pointers rather than
direct structures. I don't know of any hard data, but my intuition
tells me that explicit pointers and structs not as references (ala C) is
more efficient, but there are strong linguistic arguments in favor of
the ``everything is a pointer'' approach.
-
Completeness is not as important as I think it is. All modern
programming languages are Turing complete, so, in practice, there is no
computation they can not do. Given a simple procedural interface to
operating system services -- e.g., section 2 of the Unix manual --
programs in a language have the same ability as those of any other
language to control interaction with the outside world.
What's left is a fuzzy notion of completeness, that some things are very
convenient to do in one language and are extremely verbose in another.
That's a fact of life. There will always be some operation which your
language takes more characters and more work on behalf of the programmer
to do that C, or ML, or Eiffel, or Prolog, or APL, will do naturally and
as a built in operation.
Now, here's a list of nice features that are not essential to design,
but are things I like. Some are completely gratuitous, none require
much work to do. (except perhaps (a), which is the most important.)
-
Proper tail calling guaranteed, as in Scheme.
-
Names with punctuation in them, especially -,
!, and ?. Of course, to do this you have to
separate identifiers from operators with whitespace, which I know you
don't like as a dictate.
-
Tuples in the language, especially so you can do
(x, y) := (y, x)
and
(quo, rem) := divmod(a, b)
Why do functions take n values but return 1? I think it's mainly
syntax.
-
A with (Pascal), open (ML), or expose
(Haahr) statement, to augment the current scope with members from an
aggregate. For example,
typedef struct { double x, y, z; } P3;
double dist(P3 p) {
expose (p)
return sqrt((x * x) + (y * y) + (z * z));
}
would replace
double dist(P3 p) {
return sqrt((p.x * p.x) + (p.y * p.y) + (p.z * p.z));
}
Note that expose would evalue its argument only once.
-
Structs that contain types. For example,
typedef struct {
typedef struct { ... } Path;
typedef struct { ... } File;
Path *root;
Path *(*walk)(Path *from, char *name);
File *(*open)(Path *path);
...
} Filesystem;
so I could declare something to be a Filesystem.Path. (Note that this is
relevant to using structs as modules.)
-
An exception mechanism and unwind-protect. Syntax for this
is usually ugly, but remember, make exceptions just a normal type and
the semantics will stay pretty clean. (On the other hand, exception
handling can be added late to a language w/o disturbing the foundation
too much. Of all the things you do, design the exceptions last, IMHO.)
|