Lexical
This page documents the lexical aspects of the language. With the information presented here, it should be possible to create a tokenizer for D code.
Source Encoding
D source code can be in one of the following text encodings:
- 7-bit ASCII
- UTF-8
- UTF-16 (Little- or Big-Endian)
- UTF-32 (Little- or Big-Endian)
The encoding of the source file may be indicated by a BOM (byte-order mark) at the beginning of the source.
| Format | BOM |
| UTF-8 | EF BB BF |
| UTF-16LE | FF FE |
| UTF-16BE | FE FF |
| UTF-32LE | FF FE 00 00 |
| UTF-32BE | 00 00 FE FF |
If no BOM is present, it is assumed to be ASCII, and the first character of the source must be less than U+0080.
The text encoding used is of no concern to the actual process of tokenization. The text is decoded into Unicode Characters, which form the basis of the source code.
Rationale
Unicode is the future of text encoding. Code pages and regional encodings are complex and outmoded. Digraphs and trigraphs are unpleasant to look at and have not acquired wide use.
RFC: why is the source assumed to be ASCII if there is no BOM? Why not assume UTF-8? Many editors that save text files in UTF-8 do not put a BOM, and assuming UTF-8 is perfectly safe if the file really is 7-bit ASCII (since UTF-8 is a proper superset of it).
Source
Source: [ShebangLine] SourceCharacter+ EndOfFile ShebangLine: #! Character* EndOfLine SourceCharacter: Whitespace Token SpecialToken
D source consists of whitespace, tokens, and special token sequences, all followed by an end-of-file marker.
D source may optionally begin with a shebang line, which is used on Unix systems to indicate a scripting host. This allows D source files to be used as scripts. The shebang line is optional, must be the first line of the file, and is completely ignored by anything beyond the lexical pass. It is basically treated like a comment in that regard.
Character
You will see 'character' in various lexical rules (such as above, in the shebang line). In general, character means any besides End-of-File, or the character sequence that would end this token. e.g. for a double quoted string, character means anything except for End-of-File or ", because " would terminate the double quoted string.
End-of-File
EndOfFile: physical end of file U+0000 U+001A
When the tokenizer encounters any of these possibilities, the source text is considered to terminate. No further characters are consumed.
RFC: U+001A is otherwise known as Ctrl+Z and was used to indicate the end of files on CP/M. I have no idea why it's included in this list. It's probably never used anymore.
End-of-Line
EndOfLine: U+000A U+000D [U+000A]
Line endings can be any of the three major types (CR, LF, or CRLF). Line endings may be mixed within the same file. The end of the file is also considered a line-end.
Whitespace
Whitespace: Space+ Space: U+0020 U+0009 U+000B U+000C EndOfLine Comment
Whitespace is not significant. It exists only to separate other tokens.
U+0020 is space and U+0009 is horizontal tab. U+000B is vertical tab and U+000C is form feed, which probably aren't used that much anymore, but they're harmless.
Comments
Comment: /* Character* */ // Character* (EndOfLine | EndOfFile) NestingBlockComment NestingBlockComment: /+ NestingBlockCommentCharacter* +/ NestingBlockCommentCharacter: Character NestingBlockComment
There are three kinds of comments in D:
- C-style block comments, which can span multiple lines but do not nest;
- C++-style line comments; and
- Nesting block comments.
The contents of string literals and comments are not tokenized, and therefore comments cannot terminate strings, nor can strings exist within or terminate comments. Since comments are treated as whitespace, they cannot be used to concatenate tokens as in some very old C compilers (i.e. abc/**/def is two tokens abc and def, not a single token abcdef).
WISH: Why have a separate syntax for nesting block comments? Why not just make slash-star comments nest?
Tokens
Token:
Identifier
StringLiteral+
CharacterLiteral
IntegerLiteral
FloatLiteral
Keyword
/
/=
.
..
...
&
&=
&&
|
|=
||
-
-=
--
+
+=
++
<
<=
<<
<<=
<>
<>=
>
>=
>>=
>>>=
>>
>>>
!
!=
!<>
!<>=
!<
!<=
!>
!>=
(
)
[
]
{
}
?
,
;
:
$
=
==
*
*=
%
%=
^
^=
~
~=
Tokens form the "meat" of the source code, and are the only lexical elements which have meaning beyond the lexical phase.
Multiple string literals in sequence are considered a single token. See #StringLiterals for more information.
Identifiers
Identifier: IdentifierStart IdentifierChar* IdentifierStart: _ Letter UniversalAlpha IdentifierChar: IdentifierStart 0 NonZeroDigit
Identifier rules are similar to those found in most other C-style languages. Universal alphas are a class of Unicode characters as defined in ISO/IEC 9899:1999(E) Appendix D. See annex D in the C99 specification.
Identifiers may be arbitrarily long and are case-sensitive.
Identifiers starting with two underscores ("__") are reserved. (RFC: Should compilers diagnose an error if they encounter such a reserved identifier, or is it more "hic sunt dracones" if you decide to use such an identifier?)
String Literals
StringLiteral:
WysiwygString
AlternateWysiwygString
DoubleQuotedString
EscapeSequence
HexString
WysiwygString:
r" WysiwygCharacter* " [Postfix]
AlternateWysiwygString:
` WysiwygCharacter* ` [Postfix]
WysiwygCharacter:
Character
EndOfLine
DoubleQuotedString:
" DoubleQuotedCharacter* " [Postfix]
DoubleQuotedCharacter:
Character
EscapeSequence
EndOfLine
EscapeSequence:
\'
\"
\?
\\
\a
\b
\f
\n
\r
\t
\v
\ EndOfFile
\x HexDigit{2}
\ OctalDigit{1,3}
\u HexDigit{4}
\U HexDigit{8}
\& NamedCharacterEntity ;
HexString:
x" HexStringChar* " [Postfix]
HexStringChar:
HexDigit
WhiteSpace
Postfix:
c
w
d
There are four basic kinds of string literals: "normal" strings, WYSIWYG (what-you-see-is-what-you-get, also known as "verbatim") strings, escape strings, and hex strings. String literals that begin and end with quotes may span multiple lines. If a string spans multiple lines, the line-ends are embedded in the string as '\n' (U+000A) characters, regardless of what the line-end was in the original source.
As was mentioned in the #Tokens section, multiple string literals in a row are considered a single string literal. That is, something like:
"hello" " world" `!`
which consists of three separate lexical string literals, is collapsed into a single string literal token. It's equivalent to writing:
"hello world!"
String literals are read-only. Writing to string literals causes undefined behavior. Writing to them may be detected and flagged as an error if possible, but it is not required.
Normal String Literals
Normal string literals begin and end with double quotes (the " character). Escape sequences may be embedded within these strings which will be replaced by the appropriate character in the actual string data.
| Escape Sequence | Replaced By | Comments |
| \' | U+0027 | Apostrophe (apostrophes may also appear unescaped) |
| \" | U+0022 | Double-quote |
| \? | U+003F | Question Mark (RFC: why does this exist? there are no trigraphs) |
| \\ | U+005C | Backslash (used to output a literal backslash) |
| \a | U+0007 | Bell |
| \b | U+0008 | Backspace |
| \f | U+000C | Form feed |
| \n | U+000A | Line feed |
| \r | U+000D | Carriage return |
| \t | U+0009 | Horizontal tab |
| \v | U+000B | Vertical tab |
| \x HexDigit{2} | U+(digits) | the HexDigits are interpreted as a number |
| \ OctalDigit{1,3} | U+(digits) | the OctalDigits are interpreted as a number. May not exceed 255 (in decimal). |
| \u HexDigit{4} | U+(digits) | the HexDigits are interpreted as a number |
| \U HexDigit{8} | U+(digits) | the HexDigits are interpreted as a number |
| \& NamedCharacterEntity ; | varies | See Table of Entities? TODO: TABLE PLX |
WYSIWYG String Literals
WYSIWYG (what-you-see-is-what-you-get, also known as "verbatim") string literals are similar to normal string literals except that escape sequences are not processed, and therefore what you see in the string literal really is what you get. So for instance, if one writes r"\a", the resulting string will be two characters long, consisting of a backslash followed by a lowercase 'a'.
There are two ways to quote WYSIWYG strings: with double quotes opening with a lowercase 'r', or with backticks:
"this is a normal string for comparison" r"this is a wysiwyg string" `this is also a wysiwyg string`
The backticks are useful for quoting WYSIWYG strings that contain double quotes.
Escape Strings
Escape strings are basically strings which consist only of escape sequences and exist outside of quotes. Examples:
\n \r\n \u0040
WISH: it seems like a ridiculous waste of the backslash character to use them for escape strings. I've never, in five years of using D, seen them used outside these few examples in the spec.
Hex Strings
Hex strings allow you to embed binary data as strings. The data in a hex string is not required to be valid Unicode data. Examples:
x"DEADBEEF" // four bytes, same as "\xDE\xAD\xBE\xEF" x"D EAD BEE F" // same as above
Notice that whitespace is ignored, as are newlines. Any non-whitespace characters must be hexadecimal digits, and there must be an even number of them in the string.
String Literal Postfixes
String literals may be used in contexts where their encoding is ambiguous (should they be UTF-8, UTF-16, or UTF-32?). In these cases, you can attach a character to the end of the string literal to force it to use a certain encoding. For example:
"hello"c // UTF-8 - char[] "hello"w // UTF-16 - wchar[] "hello"d // UTF-32 - dchar[]
Character Literals
CharacterLiteral: ' (Character | EscapeSequence) '
Character literals represent a single character.
WISH: wouldn't it be nice to have c/w/d suffixes on character literals too?
Integer Literals
IntegerLiteral: Integer [IntegerSuffix] Integer: Decimal Binary Octal Hexadecimal IntegerSuffix: L u U Lu LU uL UL Decimal: 0 NonZeroDigit DecimalDigitOrUnderscore* Binary: 0b _* BinaryDigit (BinaryDigit | _)* 0B _* BinaryDigit (BinaryDigit | _)* Octal: 0 (OctalDigit | _)+ Hexadecimal: 0x _* HexDigit HexDigitOrUnderscore* 0X _* HexDigit HexDigitOrUnderscore* NonZeroDigit: 1 2 3 4 5 6 7 8 9 DecimalDigit: 0 NonZeroDigit DecimalDigitOrUnderscore: DecimalDigit _ BinaryDigit: 0 1 OctalDigit: 0 1 2 3 4 5 6 7 HexDigit: DecimalDigit a b c d e f A B C D E F HexDigitOrUnderscore: HexDigit _
Integers may not have redundant integer suffixes. For example, '0uU' and '0LL' are not lexically valid tokens.
Integer literals represent an integer value in several possible bases: decimal, hexadecimal, octal, and binary. Each are defined in a different manner:
- Decimal integers start with a digit other than 0, followed by the digits 0 through 9.
- Hexadecimal integers start with '0x' or '0X' and are written using the digits 0 through 9, 'a' through 'f', and 'A' through 'F'.
- Octal integers start with a '0' (zero) and are written using the digits 0 through 7.
- Binary integers start with '0b' or '0B' and are written using the digits 0 and 1.
Any underscores ('_') representing digits (that is, after the selection prefix and among the digits) are ignored and are useful for splitting up large numbers, such as being used as a thousands separator:
1_048_576 0xDEADBEEF_CAFEFACE 0_0 // O_O
Types
The data type of an integer literal is determined by the following rules. The range syntax used here is square brackets - [ and ] - for inclusive ends, and parentheses - ( and ) - for exclusive ends.
Decimal Literals
| In Range | Type |
| [0, 231) | int |
| [231, 263) | long |
RFC: if an integer literal is in the range [263, 264), and it doesn't have a 'U' or 'UL' suffix, should that be an error or should it become a ulong? Is there any reason why it shouldn't be a ulong?
Hex/Octal/Binary Literals
| In Range | Type |
| [0, 231) | int |
| [231, 232) | uint |
| [232, 263) | long |
| [263, 264) | ulong |
Integer Suffixes
Integer literals may have a suffix. The suffix constrains the data type of the literal by forcing it to be long, unsigned, or both, where the rules in the previous section would contradict. The effect in terms of the differences from the above table are listed in parentheses below. The possible suffixes are:
| Suffix | Effect |
| U or u | Forces the literal to always be unsigned (int -> uint, long -> ulong) |
| L | Forces the literal to always be long (int -> long, uint -> long) |
| UL, LU, uL, Lu | Forces the literal to always be ulong |
Rationale
Although the unsigned suffix may be either upper- or lowercase, the long suffix must always be uppercase. This was chosen in the interest of readability. Many fonts do not distinguish sufficiently between lowercase L and the numeral 1, making it difficult to tell whether a literal ends in a letter or in a digit.
Floating Point Literals
FloatLiteral: Float [Suffix] Decimal FloatSuffix [ImaginarySuffix] DecimalDigit DecimalDigitOrUnderscore* [RealSuffix] ImaginarySuffix Float: DecimalFloat HexFloat DecimalFloat: DecimalDigit DecimalDigitOrUnderscore* . DecimalDigitOrUnderscore* [DecimalExponent] . DecimalDigit DecimalDigitOrUnderscore* [DecimalExponent] NonZeroDigit DecimalDigitOrUnderscore* DecimalExponent DecimalExponent (e|E)[+|-] DecimalDigit DecimalDigitOrUnderscore* HexFloat: HexPrefix HexDigitOrUnderscore* . HexDigitOrUnderscore* HexExponent HexPrefix HexDigitOrUnderscore* HexExponent HexPrefix: 0x 0X HexExponent: (p|P)[+|-] DecimalDigit DecimalDigitOrUnderscore* Suffix: FloatSuffix [ImaginarySuffix] RealSuffix [ImaginarySuffix] ImaginarySuffix FloatSuffix: f F RealSuffix: L ImaginarySuffix: i
Floating Point Literals represent a floating point value in either decimal format or hexadecimal format.
- Decimal floating point literals are a series of decimal digits (0 through 9), a single instance of '.', and then another series of decimal digits which comprise the fractional component. Optionally, an exponent can be defined using 'e' or 'E' followed by a signed integer value. If an exponent is present, the value of the literal is determined by raising 10 to the power of the exponent and multiplying by the numerical part. This is similar to scientific notation.
- Decimal floating point literals may also be indicated by a float suffix ('f' or 'F') or imaginary suffix ('i').
- Hexadecimal floating point literals are a series of hexadecimal digits (0 through 9, 'a' through 'f', and 'A' through 'F'), a single instance of '.', and then another series of hexadecimal digits which comprise the fractional component. An optional exponent is denoted by a 'p' or 'P' followed by a series of hexadecimal digits representing the exponent in base 2. If an exponent is present, the value of the literal is determined by raising 2 to the power of the exponent and multiplying by the numerical part.
Just like with integer literals, any underscores ('_') representing digits are ignored.
Floating Point Suffixes
When no suffix is used, the value is typed as a double. Otherwise, several suffixes can be used to determine the data type used for the literal:
| Suffix | Data Type |
| none | double |
| f, F | float |
| L | real |
| i | idouble |
| fi, Fi | ifloat |
| Li | ireal |
It is an error if the value exceeds the limits for the type, but it is not an error if the value can be rounded to fit into the significant bits of the type.
Complex numbers (i.e. 1.45 + 3.45i) are not determined in the lexical phase and are not tokens. They are assembled from a real and an imaginary component in the semantic phase.
Keywords
Keyword: abstract alias align asm assert auto body bool break byte case cast catch cdouble cent cfloat char class const continue creal dchar debug default delegate delete deprecated do double else enum export extern false final finally float for foreach foreach_reverse function goto idouble if ifloat import in inout int interface invariant ireal is lazy long macro mixin module new null out override package pragma private protected public real ref return scope short static struct super switch synchronized template this throw true try typedef typeid typeof ubyte ucent uint ulong union unittest ushort version void volatile wchar while with
Keywords are reserved identifiers.
Special Tokens
Several tokens exist that will be replaced with other tokens in the lexical phase:
| Token | Replacement |
| __FILE__ | string literal containing source file name |
| __LINE__ | integer literal of the current source line number |
| __DATE__ | string literal for the date of compilation in the format "mmm dd yyyy" |
| __TIME__ | string literal for the time of compilation in the format "hh:mm:ss" |
| __TIMESTAMP__ | string literal of the date and the time of compilation in the format "www mmm dd hh:mm:ss yyyy" |
| __VENDER__ | string literal of compiler vendor string |
| __VERSION__ | integer literal of compiler version |
Special Token Sequences
SpecialTokenSequence: #line Integer (EndOfLine | EndOfFile) #line Integer Filespec (EndOfLine | EndOfFile) Filespec: " Character* "
Special token sequences are processed by the lexical analyzer, may appear between any other tokens, and do not affect the syntax parsing.
There is currently only one special token sequence: '#line'.
#line
This sets the source line number to Integer, and optionally the source file name to Filespec, beginning with the next line of source text. The source file and line number is used for printing error messages and for mapping generated code back to the source for the symbolic debugging output.
For example:
int #line 6 "foo\bar" x; // this is now line 6 of file foo\bar
Note that the backslash character is not treated specially inside Filespec strings, allowing paths on Windows (and other systems that use the backslash as the directory separator) to be inserted without having to worry about escape sequences.
![(please configure the [header_logo] section in trac.ini)](/dspec/chrome/site/your_project_logo.png)