1
0
treerack/docs/syntax.md

120 lines
4.5 KiB
Markdown
Raw Normal View History

2026-01-18 22:52:27 +01:00
# Treerack Syntax Definition Language
The Treerack library uses a custom grammar description language derived from EBNF (Extended Backus-Naur Form).
It allows for the concise definition of recursive descent parsers.
A syntax file consists of a series of Production Rules (definitions), terminated by semicolons.
2026-01-21 20:54:16 +01:00
## Production rules
2026-01-18 22:52:27 +01:00
A rule assigns a name to a pattern expression. Rules may include optional flags to modify the parser's behavior
or the resulting AST (Abstract Syntax Tree).
```
2026-01-21 20:54:16 +01:00
rule-name = expression;
rule-name:flag1:flag2 = expression;
2026-01-18 22:52:27 +01:00
```
## Flags
Flags are appended to the rule name, separated by colons. They control AST generation, whitespace handling, and
error propagation.
2026-01-21 20:54:16 +01:00
- `alias`: transparent node. The rule validates input but does not create its own node in the AST. Children
2026-01-18 22:52:27 +01:00
nodes (if any) are attached to the parent of this rule.
2026-01-21 20:54:16 +01:00
- `ws`: global whitespace. Marks this rule as the designated whitespace handler. The parser will attempt to
2026-01-18 22:52:27 +01:00
match (and discard) this rule between tokens throughout the entire syntax.
2026-01-21 20:54:16 +01:00
- `nows`: no whitespace. Disables automatic whitespace skipping inside this rule. Useful for defining tokens
like string literals where spaces are significant. The flag `nows` is automatically applied to char sequences
like `"abc" or [abc]+.
- `root`: entry point. Explicitly marks the rule as the starting point of the syntax. If omitted, the last
2026-01-18 22:52:27 +01:00
defined rule is implied to be the root.
2026-01-21 20:54:16 +01:00
- `kw`: keyword. Marks the content as a reserved keyword.
- `nokw`: no keyword. Prevents the rule from matching text that matches a defined kw rule. Essential for
2026-01-18 22:52:27 +01:00
distinguishing identifiers from keywords (e.g., ensuring var is not parsed as a variable name).
2026-01-21 20:54:16 +01:00
- `failpass`: pass failure. If this rule fails to parse, the error is reported as a failure of the parent rule,
2026-01-18 22:52:27 +01:00
not this specific rule.
## Expressions
Expressions define the structure of the text to be parsed. They are composed of terminals, sequences, choices,
and quantifiers.
## Terminals
Terminals match specific characters or strings in the input.
2026-01-21 20:54:16 +01:00
- `"abc"` (string): Matches an exact sequence of characters. Equivalent to [a][b][c].
2026-01-18 22:52:27 +01:00
- `.` (any char): Matches any single character (wildcard).
- `[123]`, `[a-z]`, `[123a-z]` (class): Matches a single character from a set or range.
- `[^123]`, `[^a-z]`, `[^123a-z]` (not class) Matches any single character not in the set.
## Quantifiers
Quantifiers determine how many times an item must match. They are placed immediately after the item they modify.
2026-01-21 20:54:16 +01:00
- `?`: optional (zero or one).
- `*`: zero or more.
- `+`: one or more.
- `{n}`: exact count. Matches exactly n times.
- `{n,}`: at least. Matches n or more times.
- `{,m}`: at most. Matches between 0 and m times.
- `{n,m}`: range. Matches between n and m times.
2026-01-18 22:52:27 +01:00
## Composites
Complex patterns are built by combining terminals and other rules.
### 1. Sequences
Items written consecutively are matched in order.
```
2026-01-21 20:54:16 +01:00
// matches "A", then "B", then "C":
my-sequence = "A" "B" "C";
2026-01-18 22:52:27 +01:00
```
### 2. Grouping
Parentheses (...) group items together, allowing quantifiers to apply to the entire group.
```
2026-01-21 20:54:16 +01:00
// matches "AB", "ABAB", "ABABAB"...:
my-group = ("A" "B")+;
2026-01-18 22:52:27 +01:00
```
### 3. Choices
The pipe | character represents a choice between alternatives.
The parser evaluates all provided options against the input at the current position and selects the best match
based on the following priority rules:
2026-01-21 20:54:16 +01:00
1. _longest match_: the option that consumes the largest number of characters takes priority. This eliminates the
2026-01-18 22:52:27 +01:00
need to manually order specific matches before general ones (e.g., "integer" will always be chosen over "int" if
the input supports it, regardless of their order in the definition).
2026-01-21 20:54:16 +01:00
2. _first definition wins_: if multiple options consume the exact same number of characters, the option defined
2026-01-18 22:52:27 +01:00
first(left-most) in the list takes priority.
```
2026-01-21 20:54:16 +01:00
// longest match wins automatically: input "integer" is matched by 'type', even though "int" comes first.
2026-01-18 22:52:27 +01:00
type = "int" | "integer";
2026-01-21 20:54:16 +01:00
// Tie-breaker rule: if input is "foo", both options match 3 characters. Because 'identifier' is last, it takes
// priority over 'keyword'. (Use :kw and :nokw to control such situations, when it applies.)
2026-01-18 22:52:27 +01:00
content = keyword | identifier;
```
## Comments
Comments follow C-style syntax and are ignored by the definition parser.
2026-01-21 20:54:16 +01:00
- line comments: start with // and end at the newline.
- block comments: enclosed in /* ... */.
2026-01-18 22:52:27 +01:00
## Examples
- [JSON](examples/json.treerack)
- [Scheme](examples/scheme.treerack)
- [Treerack (itself)](../syntax.treerack)