Lexical

This page documents the lexical aspects of the language. With the information presented here, it should be possible to create a tokenizer for D code.

Source Encoding

D source code can be in one of the following text encodings:

  • 7-bit ASCII
  • UTF-8
  • UTF-16 (Little- or Big-Endian)
  • UTF-32 (Little- or Big-Endian)

The encoding of the source file may be indicated by a BOM (byte-order mark) at the beginning of the source.

FormatBOM
UTF-8EF BB BF
UTF-16LEFF FE
UTF-16BEFE FF
UTF-32LEFF FE 00 00
UTF-32BE00 00 FE FF

If no BOM is present, it is assumed to be ASCII, and the first character of the source must be less than U+0080.

The text encoding used is of no concern to the actual process of tokenization. The text is decoded into Unicode Characters, which form the basis of the source code.

Rationale

Unicode is the future of text encoding. Code pages and regional encodings are complex and outmoded. Digraphs and trigraphs are unpleasant to look at and have not acquired wide use.

RFC: why is the source assumed to be ASCII if there is no BOM? Why not assume UTF-8? Many editors that save text files in UTF-8 do not put a BOM, and assuming UTF-8 is perfectly safe if the file really is 7-bit ASCII (since UTF-8 is a proper superset of it).

Source

Source:
	[ShebangLine] SourceCharacter+ EndOfFile

ShebangLine:
	#! Character* EndOfLine

SourceCharacter:
	Whitespace
	Token
	SpecialToken

D source consists of whitespace, tokens, and special token sequences, all followed by an end-of-file marker.

D source may optionally begin with a shebang line, which is used on Unix systems to indicate a scripting host. This allows D source files to be used as scripts. The shebang line is optional, must be the first line of the file, and is completely ignored by anything beyond the lexical pass. It is basically treated like a comment in that regard.

Character

You will see 'character' in various lexical rules (such as above, in the shebang line). In general, character means any besides End-of-File, or the character sequence that would end this token. e.g. for a double quoted string, character means anything except for End-of-File or ", because " would terminate the double quoted string.

End-of-File

EndOfFile:
	physical end of file
	U+0000
	U+001A

When the tokenizer encounters any of these possibilities, the source text is considered to terminate. No further characters are consumed.

RFC: U+001A is otherwise known as Ctrl+Z and was used to indicate the end of files on CP/M. I have no idea why it's included in this list. It's probably never used anymore.

End-of-Line

EndOfLine:
	U+000A
	U+000D [U+000A]

Line endings can be any of the three major types (CR, LF, or CRLF). Line endings may be mixed within the same file. The end of the file is also considered a line-end.

Whitespace

Whitespace:
	Space+

Space:
	U+0020
	U+0009
	U+000B
	U+000C
	EndOfLine
	Comment

Whitespace is not significant. It exists only to separate other tokens.

U+0020 is space and U+0009 is horizontal tab. U+000B is vertical tab and U+000C is form feed, which probably aren't used that much anymore, but they're harmless.

Comments

Comment:
	/* Character* */
	// Character* (EndOfLine | EndOfFile)
	NestingBlockComment

NestingBlockComment:
	/+ NestingBlockCommentCharacter* +/

NestingBlockCommentCharacter:
	Character
	NestingBlockComment

There are three kinds of comments in D:

  1. C-style block comments, which can span multiple lines but do not nest;
  2. C++-style line comments; and
  3. Nesting block comments.

The contents of string literals and comments are not tokenized, and therefore comments cannot terminate strings, nor can strings exist within or terminate comments. Since comments are treated as whitespace, they cannot be used to concatenate tokens as in some very old C compilers (i.e. abc/**/def is two tokens abc and def, not a single token abcdef).

WISH: Why have a separate syntax for nesting block comments? Why not just make slash-star comments nest?

Tokens

Token:
	Identifier
	StringLiteral+
	CharacterLiteral
	IntegerLiteral
	FloatLiteral
	Keyword
	/
	/=
	.
	..
	...
	&
	&=
	&&
	|
	|=
	||
	-
	-=
	--
	+
	+=
	++
	<
	<=
	<<
	<<=
	<>
	<>=
	>
	>=
	>>=
	>>>=
	>>
	>>>
	!
	!=
	!<>
	!<>=
	!<
	!<=
	!>
	!>=
	(
	)
	[
	]
	{
	}
	?
	,
	;
	:
	$
	=
	==
	*
	*=
	%
	%=
	^
	^=
	~
	~=

Tokens form the "meat" of the source code, and are the only lexical elements which have meaning beyond the lexical phase.

Multiple string literals in sequence are considered a single token. See #StringLiterals for more information.

Identifiers

Identifier:
	IdentifierStart IdentifierChar*

IdentifierStart:
	_
	Letter
	UniversalAlpha

IdentifierChar:
	IdentifierStart
	0
	NonZeroDigit

Identifier rules are similar to those found in most other C-style languages. Universal alphas are a class of Unicode characters as defined in ISO/IEC 9899:1999(E) Appendix D. See annex D in the C99 specification.

Identifiers may be arbitrarily long and are case-sensitive.

Identifiers starting with two underscores ("__") are reserved. (RFC: Should compilers diagnose an error if they encounter such a reserved identifier, or is it more "hic sunt dracones" if you decide to use such an identifier?)

String Literals

StringLiteral:
	WysiwygString
	AlternateWysiwygString
	DoubleQuotedString
	EscapeSequence
	HexString

WysiwygString:
	r" WysiwygCharacter* " [Postfix]

AlternateWysiwygString:
	` WysiwygCharacter* ` [Postfix]

WysiwygCharacter:
	Character
	EndOfLine

DoubleQuotedString:
	" DoubleQuotedCharacter* " [Postfix]

DoubleQuotedCharacter:
	Character
	EscapeSequence
	EndOfLine

EscapeSequence:
	\'
	\"
	\?
	\\
	\a
	\b
	\f
	\n
	\r
	\t
	\v
	\ EndOfFile
	\x HexDigit{2}
	\ OctalDigit{1,3}
	\u HexDigit{4}
	\U HexDigit{8}
	\& NamedCharacterEntity ;

HexString:
	x" HexStringChar* " [Postfix]

HexStringChar:
	HexDigit
	WhiteSpace

Postfix:
	c
	w
	d

There are four basic kinds of string literals: "normal" strings, WYSIWYG (what-you-see-is-what-you-get, also known as "verbatim") strings, escape strings, and hex strings. String literals that begin and end with quotes may span multiple lines. If a string spans multiple lines, the line-ends are embedded in the string as '\n' (U+000A) characters, regardless of what the line-end was in the original source.

As was mentioned in the #Tokens section, multiple string literals in a row are considered a single string literal. That is, something like:

"hello" " world" `!`

which consists of three separate lexical string literals, is collapsed into a single string literal token. It's equivalent to writing:

"hello world!"

String literals are read-only. Writing to string literals causes undefined behavior. Writing to them may be detected and flagged as an error if possible, but it is not required.

Normal String Literals

Normal string literals begin and end with double quotes (the " character). Escape sequences may be embedded within these strings which will be replaced by the appropriate character in the actual string data.

Escape SequenceReplaced ByComments
\'U+0027Apostrophe (apostrophes may also appear unescaped)
\"U+0022Double-quote
\?U+003FQuestion Mark (RFC: why does this exist? there are no trigraphs)
\\U+005CBackslash (used to output a literal backslash)
\aU+0007Bell
\bU+0008Backspace
\fU+000CForm feed
\nU+000ALine feed
\rU+000DCarriage return
\tU+0009Horizontal tab
\vU+000BVertical tab
\x HexDigit{2}U+(digits)the HexDigits are interpreted as a number
\ OctalDigit{1,3}U+(digits)the OctalDigits are interpreted as a number.
May not exceed 255 (in decimal).
\u HexDigit{4}U+(digits)the HexDigits are interpreted as a number
\U HexDigit{8}U+(digits)the HexDigits are interpreted as a number
\& NamedCharacterEntity ;variesSee Table of Entities? TODO: TABLE PLX

WYSIWYG String Literals

WYSIWYG (what-you-see-is-what-you-get, also known as "verbatim") string literals are similar to normal string literals except that escape sequences are not processed, and therefore what you see in the string literal really is what you get. So for instance, if one writes r"\a", the resulting string will be two characters long, consisting of a backslash followed by a lowercase 'a'.

There are two ways to quote WYSIWYG strings: with double quotes opening with a lowercase 'r', or with backticks:

"this is a normal string for comparison"
r"this is a wysiwyg string"
`this is also a wysiwyg string`

The backticks are useful for quoting WYSIWYG strings that contain double quotes.

Escape Strings

Escape strings are basically strings which consist only of escape sequences and exist outside of quotes. Examples:

\n
\r\n
\u0040

WISH: it seems like a ridiculous waste of the backslash character to use them for escape strings. I've never, in five years of using D, seen them used outside these few examples in the spec.

Hex Strings

Hex strings allow you to embed binary data as strings. The data in a hex string is not required to be valid Unicode data. Examples:

x"DEADBEEF" // four bytes, same as "\xDE\xAD\xBE\xEF"
x"D EAD BEE F" // same as above

Notice that whitespace is ignored, as are newlines. Any non-whitespace characters must be hexadecimal digits, and there must be an even number of them in the string.

String Literal Postfixes

String literals may be used in contexts where their encoding is ambiguous (should they be UTF-8, UTF-16, or UTF-32?). In these cases, you can attach a character to the end of the string literal to force it to use a certain encoding. For example:

"hello"c // UTF-8 - char[]
"hello"w // UTF-16 - wchar[]
"hello"d // UTF-32 - dchar[]

Character Literals

CharacterLiteral:
	' (Character | EscapeSequence) '

Character literals represent a single character.

WISH: wouldn't it be nice to have c/w/d suffixes on character literals too?

Integer Literals

IntegerLiteral:
	Integer [IntegerSuffix]

Integer:
	Decimal
	Binary
	Octal
	Hexadecimal

IntegerSuffix:
	L
	u
	U
	Lu
	LU
	uL
	UL

Decimal:
	0
	NonZeroDigit DecimalDigitOrUnderscore*

Binary:
	0b _* BinaryDigit (BinaryDigit | _)*
	0B _* BinaryDigit (BinaryDigit | _)*

Octal:
	0 (OctalDigit | _)+

Hexadecimal:
	0x _* HexDigit HexDigitOrUnderscore*
	0X _* HexDigit HexDigitOrUnderscore*

NonZeroDigit:
	1
	2
	3
	4
	5
	6
	7
	8
	9

DecimalDigit:
	0
	NonZeroDigit

DecimalDigitOrUnderscore:
	DecimalDigit
	_

BinaryDigit:
	0
	1

OctalDigit:
	0
	1
	2
	3
	4
	5
	6
	7

HexDigit:
	DecimalDigit
	a
	b
	c
	d
	e
	f
	A
	B
	C
	D
	E
	F

HexDigitOrUnderscore:
	HexDigit
	_

Integers may not have redundant integer suffixes. For example, '0uU' and '0LL' are not lexically valid tokens.

Integer literals represent an integer value in several possible bases: decimal, hexadecimal, octal, and binary. Each are defined in a different manner:

  • Decimal integers start with a digit other than 0, followed by the digits 0 through 9.
  • Hexadecimal integers start with '0x' or '0X' and are written using the digits 0 through 9, 'a' through 'f', and 'A' through 'F'.
  • Octal integers start with a '0' (zero) and are written using the digits 0 through 7.
  • Binary integers start with '0b' or '0B' and are written using the digits 0 and 1.

Any underscores ('_') representing digits (that is, after the selection prefix and among the digits) are ignored and are useful for splitting up large numbers, such as being used as a thousands separator:

1_048_576
0xDEADBEEF_CAFEFACE
0_0 // O_O

Types

The data type of an integer literal is determined by the following rules. The range syntax used here is square brackets - [ and ] - for inclusive ends, and parentheses - ( and ) - for exclusive ends.

Decimal Literals

In RangeType
[0, 231)int
[231, 263)long

RFC: if an integer literal is in the range [263, 264), and it doesn't have a 'U' or 'UL' suffix, should that be an error or should it become a ulong? Is there any reason why it shouldn't be a ulong?

Hex/Octal/Binary Literals

In RangeType
[0, 231)int
[231, 232)uint
[232, 263)long
[263, 264)ulong

Integer Suffixes

Integer literals may have a suffix. The suffix constrains the data type of the literal by forcing it to be long, unsigned, or both, where the rules in the previous section would contradict. The effect in terms of the differences from the above table are listed in parentheses below. The possible suffixes are:

SuffixEffect
U or uForces the literal to always be unsigned (int -> uint, long -> ulong)
LForces the literal to always be long (int -> long, uint -> long)
UL, LU, uL, LuForces the literal to always be ulong

Rationale

Although the unsigned suffix may be either upper- or lowercase, the long suffix must always be uppercase. This was chosen in the interest of readability. Many fonts do not distinguish sufficiently between lowercase L and the numeral 1, making it difficult to tell whether a literal ends in a letter or in a digit.

Floating Point Literals

FloatLiteral:
	Float [Suffix]
	Decimal FloatSuffix [ImaginarySuffix]
	DecimalDigit DecimalDigitOrUnderscore* [RealSuffix] ImaginarySuffix

Float:
	DecimalFloat
	HexFloat

DecimalFloat:
	DecimalDigit DecimalDigitOrUnderscore* . DecimalDigitOrUnderscore* [DecimalExponent]
	. DecimalDigit DecimalDigitOrUnderscore* [DecimalExponent]
	NonZeroDigit DecimalDigitOrUnderscore* DecimalExponent

DecimalExponent
	(e|E)[+|-] DecimalDigit DecimalDigitOrUnderscore*

HexFloat:
	HexPrefix HexDigitOrUnderscore* . HexDigitOrUnderscore* HexExponent
	HexPrefix HexDigitOrUnderscore* HexExponent

HexPrefix:
	0x
	0X

HexExponent:
	(p|P)[+|-] DecimalDigit DecimalDigitOrUnderscore*

Suffix:
	FloatSuffix [ImaginarySuffix]
	RealSuffix [ImaginarySuffix]
	ImaginarySuffix

FloatSuffix:
	f
	F

RealSuffix:
	L

ImaginarySuffix:
	i

Floating Point Literals represent a floating point value in either decimal format or hexadecimal format.

  • Decimal floating point literals are a series of decimal digits (0 through 9), a single instance of '.', and then another series of decimal digits which comprise the fractional component. Optionally, an exponent can be defined using 'e' or 'E' followed by a signed integer value. If an exponent is present, the value of the literal is determined by raising 10 to the power of the exponent and multiplying by the numerical part. This is similar to scientific notation.
  • Decimal floating point literals may also be indicated by a float suffix ('f' or 'F') or imaginary suffix ('i').
  • Hexadecimal floating point literals are a series of hexadecimal digits (0 through 9, 'a' through 'f', and 'A' through 'F'), a single instance of '.', and then another series of hexadecimal digits which comprise the fractional component. An optional exponent is denoted by a 'p' or 'P' followed by a series of hexadecimal digits representing the exponent in base 2. If an exponent is present, the value of the literal is determined by raising 2 to the power of the exponent and multiplying by the numerical part.

Just like with integer literals, any underscores ('_') representing digits are ignored.

Floating Point Suffixes

When no suffix is used, the value is typed as a double. Otherwise, several suffixes can be used to determine the data type used for the literal:

SuffixData Type
nonedouble
f, Ffloat
Lreal
iidouble
fi, Fiifloat
Liireal

It is an error if the value exceeds the limits for the type, but it is not an error if the value can be rounded to fit into the significant bits of the type.

Complex numbers (i.e. 1.45 + 3.45i) are not determined in the lexical phase and are not tokens. They are assembled from a real and an imaginary component in the semantic phase.

Keywords

Keyword:
	abstract
	alias
	align
	asm
	assert
	auto

	body
	bool
	break
	byte

	case
	cast
	catch
	cdouble
	cent
	cfloat
	char
	class
	const
	continue
	creal

	dchar
	debug
	default
	delegate
	delete
	deprecated
	do
	double

	else
	enum
	export
	extern

	false
	final
	finally
	float
	for
	foreach
	foreach_reverse
	function

	goto

	idouble
	if
	ifloat
	import
	in
	inout
	int
	interface
	invariant
	ireal
	is

	lazy
	long

	macro
	mixin
	module

	new
	null

	out
	override

	package
	pragma
	private
	protected
	public

	real
	ref
	return

	scope
	short
	static
	struct
	super
	switch
	synchronized

	template
	this
	throw
	true
	try
	typedef
	typeid
	typeof

	ubyte
	ucent
	uint
	ulong
	union
	unittest
	ushort

	version
	void
	volatile

	wchar
	while
	with

Keywords are reserved identifiers.

Special Tokens

Several tokens exist that will be replaced with other tokens in the lexical phase:

TokenReplacement
__FILE__string literal containing source file name
__LINE__integer literal of the current source line number
__DATE__string literal for the date of compilation in the format "mmm dd yyyy"
__TIME__string literal for the time of compilation in the format "hh:mm:ss"
__TIMESTAMP__string literal of the date and the time of compilation in the format "www mmm dd hh:mm:ss yyyy"
__VENDER__string literal of compiler vendor string
__VERSION__integer literal of compiler version

Special Token Sequences

SpecialTokenSequence:
	#line Integer (EndOfLine | EndOfFile)
	#line Integer Filespec (EndOfLine | EndOfFile)

Filespec:
	" Character* "

Special token sequences are processed by the lexical analyzer, may appear between any other tokens, and do not affect the syntax parsing.

There is currently only one special token sequence: '#line'.

#line

This sets the source line number to Integer, and optionally the source file name to Filespec, beginning with the next line of source text. The source file and line number is used for printing error messages and for mapping generated code back to the source for the symbolic debugging output.

For example:

int #line 6 "foo\bar"
x;			// this is now line 6 of file foo\bar

Note that the backslash character is not treated specially inside Filespec strings, allowing paths on Windows (and other systems that use the backslash as the directory separator) to be inserted without having to worry about escape sequences.