Compilation

In order to simplify the implementation of D compilers, the compilation of D source is divided into several phases. Each phase may only depend on the phases that precede it. For instance, lexical analysis may not depend on syntactic analysis. The phases of compilation are as follows.

  1. Source Encoding. The text encoding of the source code is determined. 7-bit ASCII and all Unicode encodings are allowed.
  2. Script Line. If the first line of the source file begins with a "shebang" (#!) sequence, the line is ignored.
  3. Lexical Analysis. The source text is split into a sequence of tokens. Some special tokens are processed at this stage.
  4. Syntactic Analysis. The sequence of tokens is parsed into an abstract syntax tree according to the language's grammar.
  5. Semantic Analysis. This is by far the most complex phase. The syntax tree representing the program is checked for consistency according to the semantics of the language. This includes several sub-tasks, including (but not limited to):
    • Symbols: generating symbol tables that map from symbol names to program objects
    • Type checking: determining the types of all expressions; ensuring that types are used in accordance with the rules that govern them; performing type inference
    • Constant folding: evaluating expressions involving only constants at compile-time; performing compile-time function evaluation
    • Metacompiling: string mixins allow arbitrary string data to be treated as D code at compile-time, which must be tokenized, parsed etc. However, the semantic analysis is not fully performed before doing string mixins, so this must be performed in a "back-and-forth" fashion (compiling code, doing some semantic analysis on it, compiling its mixins, doing more semantic analysis etc.).
  6. [Optimization.] The program may optionally be rewritten in such a way to improve performance while preserving program semantics.
  7. Code Generation. The program is converted from its internal representation into some kind of executable format, be it native machine code, VM code, source code for another language etc.

Rationale

This strict ordering of phases with no forward dependencies is borne out of experience implementing compilers for C++, in which (for example) the lexical analysis cannot be fully performed until after syntactic analysis and some symbol resolution has been performed. This necessitates parsing code that "might" be right, keeping strange internal structures which must be rewritten according to the results of symbol resolution. Not only does it make it difficult to write a compiler, it also means tools that should be trivial (like code highlighters and beautifiers) are required to implement almost a full compiler to function properly, or else run the risk of functioning only "mostly correctly." A more extreme example would be Perl, in which the proper tokenization of some code may be ambiguous until runtime! With this simple model of compilation, compilers are easier to write, tools are kept simple, and the cognitive load on the programmer is reduced.

Questions

In what grammar class does D fall?
I'm pretty sure it's context-free. In the lexical phase, there is some minor lookahead needed with numeric literals to disambiguate decimal points versus slices, and the syntactic phase requires theoretically arbitrary lookahead (but in practice, usually no more than a couple tokens) to parse some declarations within function bodies. Other than that, it's very simple to parse.