treerack/docs/syntax.md

# Treerack Syntax Definition Language

The Treerack library uses a custom grammar description language derived from EBNF (Extended Backus-Naur Form).
It allows for the concise definition of recursive descent parsers.

A syntax file consists of a series of Production Rules (definitions), terminated by semicolons.

## Production Rules

A rule assigns a name to a pattern expression. Rules may include optional flags to modify the parser's behavior
or the resulting AST (Abstract Syntax Tree).

```
RuleName = Expression;
RuleName:flag1:flag2 = Expression;
```

## Flags

Flags are appended to the rule name, separated by colons. They control AST generation, whitespace handling, and
error propagation.

- `alias`: Transparent Node. The rule validates input but does not create its own node in the AST. Children
  nodes (if any) are attached to the parent of this rule.
- `ws`: Global Whitespace. Marks this rule as the designated whitespace handler. The parser will attempt to
  match (and discard) this rule between tokens throughout the entire syntax.
- `nows`: No Whitespace. Disables automatic whitespace skipping inside this rule. Useful for defining tokens
  like string literals where spaces are significant.
- `root`: Entry Point. Explicitly marks the rule as the starting point of the syntax. If omitted, the last
  defined rule is implied to be the root.
- `kw`: Keyword. Marks the content as a reserved keyword.
- `nokw`: No Keyword. Prevents the rule from matching text that matches a defined kw rule. Essential for
  distinguishing identifiers from keywords (e.g., ensuring var is not parsed as a variable name).
- `failpass`: Pass Failure. If this rule fails to parse, the error is reported as a failure of the parent rule,
  not this specific rule.

## Expressions

Expressions define the structure of the text to be parsed. They are composed of terminals, sequences, choices,
and quantifiers.

## Terminals

Terminals match specific characters or strings in the input.

- `"abc"` (string): Matches an exact sequence of characters.
- `.` (any char): Matches any single character (wildcard).
- `[123]`, `[a-z]`, `[123a-z]` (class): Matches a single character from a set or range.
- `[^123]`, `[^a-z]`, `[^123a-z]` (not class) Matches any single character not in the set.

## Quantifiers

Quantifiers determine how many times an item must match. They are placed immediately after the item they modify.

- `?`: Optional (Zero or one).
- `*`: Zero or more.
- `+`: One or more.
- `{n}`: Exact count. Matches exactly n times.
- `{n,}`: At least. Matches n or more times.
- `{,m}`: At most. Matches between 0 and m times.
- `{n,m}`: Range. Matches between n and m times.

## Composites

Complex patterns are built by combining terminals and other rules.

### 1. Sequences

Items written consecutively are matched in order.

```
// Matches "A", then "B", then "C"
MySequence = "A" "B" "C";
```

### 2. Grouping

Parentheses (...) group items together, allowing quantifiers to apply to the entire group.

```
// Matches "AB", "ABAB", "ABABAB"...
MyGroup = ("A" "B")+;
```

### 3. Choices

The pipe | character represents a choice between alternatives.

The parser evaluates all provided options against the input at the current position and selects the best match
based on the following priority rules:

1. _Longest Match_: The option that consumes the largest number of characters takes priority. This eliminates the
need to manually order specific matches before general ones (e.g., "integer" will always be chosen over "int" if
the input supports it, regardless of their order in the definition).
2. _First Definition Wins_: If multiple options consume the exact same number of characters, the option defined
first(left-most) in the list takes priority.

```
// Longest match wins automatically:
// Input "integer" is matched by 'type', even though "int" comes first.
type = "int" | "integer";

// Tie-breaker rule:
// If input is "foo", both options match 3 characters.
// Because 'identifier' is last, it takes priority over 'keyword'.
// (Use :kw and :nokw to control such situations, when it applies.)
content = keyword | identifier;
```

## Comments

Comments follow C-style syntax and are ignored by the definition parser.

- Line comments: Start with // and end at the newline.
- Block comments: Enclosed in /* ... */.

## Examples

- [JSON](examples/json.treerack)
- [Scheme](examples/scheme.treerack)
- [Treerack (itself)](../syntax.treerack)
documentation 2026-01-18 22:52:27 +01:00			`# Treerack Syntax Definition Language`

			`The Treerack library uses a custom grammar description language derived from EBNF (Extended Backus-Naur Form).`
			`It allows for the concise definition of recursive descent parsers.`

			`A syntax file consists of a series of Production Rules (definitions), terminated by semicolons.`

			`## Production Rules`

			`A rule assigns a name to a pattern expression. Rules may include optional flags to modify the parser's behavior`
			`or the resulting AST (Abstract Syntax Tree).`

			```
			`RuleName = Expression;`
			`RuleName:flag1:flag2 = Expression;`
			```

			`## Flags`

			`Flags are appended to the rule name, separated by colons. They control AST generation, whitespace handling, and`
			`error propagation.`

			- `alias`: Transparent Node. The rule validates input but does not create its own node in the AST. Children
			`nodes (if any) are attached to the parent of this rule.`
			- `ws`: Global Whitespace. Marks this rule as the designated whitespace handler. The parser will attempt to
			`match (and discard) this rule between tokens throughout the entire syntax.`
			- `nows`: No Whitespace. Disables automatic whitespace skipping inside this rule. Useful for defining tokens
			`like string literals where spaces are significant.`
			- `root`: Entry Point. Explicitly marks the rule as the starting point of the syntax. If omitted, the last
			`defined rule is implied to be the root.`
			- `kw`: Keyword. Marks the content as a reserved keyword.
			- `nokw`: No Keyword. Prevents the rule from matching text that matches a defined kw rule. Essential for
			`distinguishing identifiers from keywords (e.g., ensuring var is not parsed as a variable name).`
			- `failpass`: Pass Failure. If this rule fails to parse, the error is reported as a failure of the parent rule,
			`not this specific rule.`

			`## Expressions`

			`Expressions define the structure of the text to be parsed. They are composed of terminals, sequences, choices,`
			`and quantifiers.`

			`## Terminals`

			`Terminals match specific characters or strings in the input.`

			- `"abc"` (string): Matches an exact sequence of characters.
			- `.` (any char): Matches any single character (wildcard).
			- `[123]`, `[a-z]`, `[123a-z]` (class): Matches a single character from a set or range.
			- `[^123]`, `[^a-z]`, `[^123a-z]` (not class) Matches any single character not in the set.

			`## Quantifiers`

			`Quantifiers determine how many times an item must match. They are placed immediately after the item they modify.`

			- `?`: Optional (Zero or one).
			- `*`: Zero or more.
			- `+`: One or more.
			- `{n}`: Exact count. Matches exactly n times.
			- `{n,}`: At least. Matches n or more times.
			- `{,m}`: At most. Matches between 0 and m times.
			- `{n,m}`: Range. Matches between n and m times.

			`## Composites`

			`Complex patterns are built by combining terminals and other rules.`

			`### 1. Sequences`

			`Items written consecutively are matched in order.`

			```
			`// Matches "A", then "B", then "C"`
			`MySequence = "A" "B" "C";`
			```

			`### 2. Grouping`

			`Parentheses (...) group items together, allowing quantifiers to apply to the entire group.`

			```
			`// Matches "AB", "ABAB", "ABABAB"...`
			`MyGroup = ("A" "B")+;`
			```

			`### 3. Choices`

			`The pipe \| character represents a choice between alternatives.`

			`The parser evaluates all provided options against the input at the current position and selects the best match`
			`based on the following priority rules:`

			`1. _Longest Match_: The option that consumes the largest number of characters takes priority. This eliminates the`
			`need to manually order specific matches before general ones (e.g., "integer" will always be chosen over "int" if`
			`the input supports it, regardless of their order in the definition).`
			`2. _First Definition Wins_: If multiple options consume the exact same number of characters, the option defined`
			`first(left-most) in the list takes priority.`

			```
			`// Longest match wins automatically:`
			`// Input "integer" is matched by 'type', even though "int" comes first.`
			`type = "int" \| "integer";`

			`// Tie-breaker rule:`
			`// If input is "foo", both options match 3 characters.`
			`// Because 'identifier' is last, it takes priority over 'keyword'.`
			`// (Use :kw and :nokw to control such situations, when it applies.)`
			`content = keyword \| identifier;`
			```

			`## Comments`

			`Comments follow C-style syntax and are ignored by the definition parser.`

			`- Line comments: Start with // and end at the newline.`
			`- Block comments: Enclosed in /* ... */.`

			`## Examples`

			`- [JSON](examples/json.treerack)`
			`- [Scheme](examples/scheme.treerack)`
			`- [Treerack (itself)](../syntax.treerack)`