122 lines
4.4 KiB
Markdown
122 lines
4.4 KiB
Markdown
|
|
# Treerack Syntax Definition Language
|
||
|
|
|
||
|
|
The Treerack library uses a custom grammar description language derived from EBNF (Extended Backus-Naur Form).
|
||
|
|
It allows for the concise definition of recursive descent parsers.
|
||
|
|
|
||
|
|
A syntax file consists of a series of Production Rules (definitions), terminated by semicolons.
|
||
|
|
|
||
|
|
## Production Rules
|
||
|
|
|
||
|
|
A rule assigns a name to a pattern expression. Rules may include optional flags to modify the parser's behavior
|
||
|
|
or the resulting AST (Abstract Syntax Tree).
|
||
|
|
|
||
|
|
```
|
||
|
|
RuleName = Expression;
|
||
|
|
RuleName:flag1:flag2 = Expression;
|
||
|
|
```
|
||
|
|
|
||
|
|
## Flags
|
||
|
|
|
||
|
|
Flags are appended to the rule name, separated by colons. They control AST generation, whitespace handling, and
|
||
|
|
error propagation.
|
||
|
|
|
||
|
|
- `alias`: Transparent Node. The rule validates input but does not create its own node in the AST. Children
|
||
|
|
nodes (if any) are attached to the parent of this rule.
|
||
|
|
- `ws`: Global Whitespace. Marks this rule as the designated whitespace handler. The parser will attempt to
|
||
|
|
match (and discard) this rule between tokens throughout the entire syntax.
|
||
|
|
- `nows`: No Whitespace. Disables automatic whitespace skipping inside this rule. Useful for defining tokens
|
||
|
|
like string literals where spaces are significant.
|
||
|
|
- `root`: Entry Point. Explicitly marks the rule as the starting point of the syntax. If omitted, the last
|
||
|
|
defined rule is implied to be the root.
|
||
|
|
- `kw`: Keyword. Marks the content as a reserved keyword.
|
||
|
|
- `nokw`: No Keyword. Prevents the rule from matching text that matches a defined kw rule. Essential for
|
||
|
|
distinguishing identifiers from keywords (e.g., ensuring var is not parsed as a variable name).
|
||
|
|
- `failpass`: Pass Failure. If this rule fails to parse, the error is reported as a failure of the parent rule,
|
||
|
|
not this specific rule.
|
||
|
|
|
||
|
|
## Expressions
|
||
|
|
|
||
|
|
Expressions define the structure of the text to be parsed. They are composed of terminals, sequences, choices,
|
||
|
|
and quantifiers.
|
||
|
|
|
||
|
|
## Terminals
|
||
|
|
|
||
|
|
Terminals match specific characters or strings in the input.
|
||
|
|
|
||
|
|
- `"abc"` (string): Matches an exact sequence of characters.
|
||
|
|
- `.` (any char): Matches any single character (wildcard).
|
||
|
|
- `[123]`, `[a-z]`, `[123a-z]` (class): Matches a single character from a set or range.
|
||
|
|
- `[^123]`, `[^a-z]`, `[^123a-z]` (not class) Matches any single character not in the set.
|
||
|
|
|
||
|
|
## Quantifiers
|
||
|
|
|
||
|
|
Quantifiers determine how many times an item must match. They are placed immediately after the item they modify.
|
||
|
|
|
||
|
|
- `?`: Optional (Zero or one).
|
||
|
|
- `*`: Zero or more.
|
||
|
|
- `+`: One or more.
|
||
|
|
- `{n}`: Exact count. Matches exactly n times.
|
||
|
|
- `{n,}`: At least. Matches n or more times.
|
||
|
|
- `{,m}`: At most. Matches between 0 and m times.
|
||
|
|
- `{n,m}`: Range. Matches between n and m times.
|
||
|
|
|
||
|
|
## Composites
|
||
|
|
|
||
|
|
Complex patterns are built by combining terminals and other rules.
|
||
|
|
|
||
|
|
### 1. Sequences
|
||
|
|
|
||
|
|
Items written consecutively are matched in order.
|
||
|
|
|
||
|
|
```
|
||
|
|
// Matches "A", then "B", then "C"
|
||
|
|
MySequence = "A" "B" "C";
|
||
|
|
```
|
||
|
|
|
||
|
|
### 2. Grouping
|
||
|
|
|
||
|
|
Parentheses (...) group items together, allowing quantifiers to apply to the entire group.
|
||
|
|
|
||
|
|
```
|
||
|
|
// Matches "AB", "ABAB", "ABABAB"...
|
||
|
|
MyGroup = ("A" "B")+;
|
||
|
|
```
|
||
|
|
|
||
|
|
### 3. Choices
|
||
|
|
|
||
|
|
The pipe | character represents a choice between alternatives.
|
||
|
|
|
||
|
|
The parser evaluates all provided options against the input at the current position and selects the best match
|
||
|
|
based on the following priority rules:
|
||
|
|
|
||
|
|
1. _Longest Match_: The option that consumes the largest number of characters takes priority. This eliminates the
|
||
|
|
need to manually order specific matches before general ones (e.g., "integer" will always be chosen over "int" if
|
||
|
|
the input supports it, regardless of their order in the definition).
|
||
|
|
2. _First Definition Wins_: If multiple options consume the exact same number of characters, the option defined
|
||
|
|
first(left-most) in the list takes priority.
|
||
|
|
|
||
|
|
```
|
||
|
|
// Longest match wins automatically:
|
||
|
|
// Input "integer" is matched by 'type', even though "int" comes first.
|
||
|
|
type = "int" | "integer";
|
||
|
|
|
||
|
|
// Tie-breaker rule:
|
||
|
|
// If input is "foo", both options match 3 characters.
|
||
|
|
// Because 'identifier' is last, it takes priority over 'keyword'.
|
||
|
|
// (Use :kw and :nokw to control such situations, when it applies.)
|
||
|
|
content = keyword | identifier;
|
||
|
|
```
|
||
|
|
|
||
|
|
## Comments
|
||
|
|
|
||
|
|
Comments follow C-style syntax and are ignored by the definition parser.
|
||
|
|
|
||
|
|
- Line comments: Start with // and end at the newline.
|
||
|
|
- Block comments: Enclosed in /* ... */.
|
||
|
|
|
||
|
|
## Examples
|
||
|
|
|
||
|
|
- [JSON](examples/json.treerack)
|
||
|
|
- [Scheme](examples/scheme.treerack)
|
||
|
|
- [Treerack (itself)](../syntax.treerack)
|