diff --git a/doc/.gitignore b/doc/.gitignore new file mode 100644 index 0000000..0067d87 --- /dev/null +++ b/doc/.gitignore @@ -0,0 +1 @@ +/book/ diff --git a/doc/book.toml b/doc/book.toml new file mode 100644 index 0000000..2483de5 --- /dev/null +++ b/doc/book.toml @@ -0,0 +1,10 @@ +[lalrpop] +title = "LALRPOP Documentation" +description = "LALRPOP Documentation" +authors = ["Niko Matsakis"] + +[output.html] +mathjax-support = true + +[output.html.playpen] +editable = true diff --git a/doc/src/README.md b/doc/src/README.md new file mode 100644 index 0000000..1c99dfd --- /dev/null +++ b/doc/src/README.md @@ -0,0 +1,14 @@ +# LALRPOP + +LALRPOP is a parser generator, similar in principle to [YACC], [ANTLR], [Menhir], +and other such programs. In general, it has the grand ambition of +being the most usable parser generator ever. This ambition is most +certainly not fully realized: right now, it's fairly standard, maybe +even a bit subpar in some areas. But hey, it's young. For the most +part, this README is intended to describe the current behavior of +LALRPOP, but in some places it includes notes for planned future +changes. + +[YACC]: http://dinosaur.compilertools.net/yacc/ +[ANTLR]: http://www.antlr.org/ +[Menhir]: http://gallium.inria.fr/~fpottier/menhir/ diff --git a/doc/src/SUMMARY.md b/doc/src/SUMMARY.md new file mode 100644 index 0000000..cd1a705 --- /dev/null +++ b/doc/src/SUMMARY.md @@ -0,0 +1,14 @@ +# Summary + +- [LALRPOP](README.md) +- [Crash course on parsers](crash_course.md) +- [Tutorial](tutorial/index.md) + - [Adding LALRPOP to your project](tutorial/001_adding_lalrpop.md) + - [Parsing parenthesized numbers](tutorial/002_paren_numbers.md) + - [Type inference](tutorial/003_type_inference.md) + - [Controlling the lexer](tutorial/004_controlling_lexer.md) + - [Handling full expressions](tutorial/005_full_expressions.md) + - [Building ASTs](tutorial/006_building_asts.md) + - [Macros](tutorial/007_macros.md) + - [Error recovery](tutorial/008_error_recovery.md) +- [Writing a custom lexer](lexer_tutorial/index.md) diff --git a/doc/src/crash_course.md b/doc/src/crash_course.md new file mode 100644 index 0000000..11aa3e1 --- /dev/null +++ b/doc/src/crash_course.md @@ -0,0 +1,70 @@ +# Crash course on parsers + +If you've never worked with a parser generator before, or aren't +really familiar with context-free grammars, this section is just a +*very brief* introduction into the basic idea. Basically a grammar is +a nice way of writing out what kinds of inputs are legal. In our +example, we want to support parenthesized numbers, so things like +`123`, `(123)`, etc. We can express this with a simple grammar like: + +``` +Term = Num | "(" Term ")" +``` + +Here we say we are trying to parse a *term*, and a term can either be +a number (`Num`) or some other term enclosing in parentheses (here I +did not define what a number is, but in the real LALRPOP example we'll +do that with a regular expression). Now imagine a potential input +like `((123))`. We can show how this would be parsed by writing out +something called a "parse tree": + +``` +( ( 1 2 3 ) ) +| | | | | | +| | +-Num-+ | | +| | | | | +| | Term | | +| | | | | +| +---Term----+ | +| | | ++------Term-------+ +``` + +Here you can see that we parsed `((123))` by finding a `Num` in the +middle, calling that `Num` a `Term`, and matching up the parentheses +to form two more terms on top of that. + +Note that this parse tree is not a data structure but more a +visualization of the parse. I mean, you *can* build up a parse tree as +a data structure, but typically you don't want to: it is more detailed +than you need. For example, you may not be that interested in the +no-op conversion from a `Num` to a `Term`. The other weird thing about +a parse tree is that it is intimately tied to your grammar, but often +you have some existing data structures you would like to parse into -- +so if you built up a parse tree, you'd then have to convert from the +parse tree into those data structures, and that might be annoying. + +Therefore, what a parser generator usually does, is instead let you +choose how to represent each node in the parse tree, and how to do the +conversions. You give each nonterminal a type, which can be any Rust +type, and you write code that will execute each time a new node in the +parse tree would have been constructed. In fact, in the examples that follow, we'll +eventually build up something like a parse tree, but in the beginning, we won't +do that at all. Instead, we'll represent each number and term as an `i32`, +and we'll propagate this value around. + +To make this a bit more concrete, here's a version of the grammar above +written in LALRPOP notation (we'll revisit this again in more detail of course). +You can see that the `Term` nonterminal has been given the type `i32`, +and that each of the definitions has some code that follows a `=>` symbol. +This is the code that will execute to convert from the thing that was matched +(like a number, or a parenthesized term) into an `i32`: + +```lalrpop +Term: i32 = { + Num => /* ... number code ... */, + "(" Term ")" => /* ... parenthesized code ... */, +}; +``` + +OK, that's enough background, let's do this for real! diff --git a/doc/src/lexer_tutorial.md b/doc/src/lexer_tutorial.md new file mode 100644 index 0000000..675c2e2 --- /dev/null +++ b/doc/src/lexer_tutorial.md @@ -0,0 +1 @@ +# Lexer Tutorial diff --git a/doc/lexer_tutorial.md b/doc/src/lexer_tutorial/index.md similarity index 98% rename from doc/lexer_tutorial.md rename to doc/src/lexer_tutorial/index.md index ef6430d..7d48c37 100644 --- a/doc/lexer_tutorial.md +++ b/doc/src/lexer_tutorial/index.md @@ -2,7 +2,7 @@ Let's say we want to parse the Whitespace language, so we've put together a grammar like the following: -```rust +```lalrpop pub Program = ; Statement: ast::Stmt = { @@ -37,7 +37,7 @@ At the moment, LALRPOP doesn't allow you to configure the default tokenizer. In Let's start by defining the stream format. The parser will accept an iterator where each item in the stream has the following structure: -```rust +```lalrpop pub type Spanned = Result<(Loc, Tok, Loc), Error>; ``` @@ -47,7 +47,7 @@ pub type Spanned = Result<(Loc, Tok, Loc), Error>; Whitespace is a simple language from a lexical standpoint, with only three valid tokens: -```rust +```lalrpop pub enum Tok { Space, Tab, @@ -57,7 +57,7 @@ pub enum Tok { Everything else is a comment. There are no invalid lexes, so we'll define our own error type, a void enum: -```rust +```lalrpop pub enum LexicalError { // Not possible } @@ -65,7 +65,7 @@ pub enum LexicalError { Now for the lexer itself. We'll take a string slice as its input. For each token we process, we'll want to know the character value, and the byte offset in the string where it begins. We can do that by wrapping the `CharIndices` iterator, which yields tuples of `(usize, char)` representing exactly that information. -```rust +```lalrpop use std::str::CharIndices; pub struct Lexer<'input> { @@ -91,7 +91,7 @@ Let's review our rules: Writing a lexer for a language with multi-character tokens can get very complicated, but this is so straightforward, we can translate it directly into code without thinking very hard. Here's our `Iterator` implementation: -```rust +```lalrpop impl<'input> Iterator for Lexer<'input> { type Item = Spanned; @@ -116,7 +116,7 @@ That's it. That's all we need. To use this with LALRPOP, we need to expose its API to the parser. It's pretty easy to do, but also somewhat magical, so pay close attention. Pick a convenient place in the grammar file (I chose the bottom) and insert an `extern` block: -```rust +```lalrpop extern { // ... } @@ -124,7 +124,7 @@ extern { Now we tell LALRPOP about the `Location` and `Error` types, as if we're writing a trait: -```rust +```lalrpop extern { type Location = usize; type Error = lexer::LexicalError; @@ -135,7 +135,7 @@ extern { We expose the `Tok` type by kinda sorta redeclaring it: -```rust +```lalrpop extern { type Location = usize; type Error = lexer::LexicalError; @@ -150,7 +150,7 @@ Now we have to declare each of our terminals. For each variant of `Tok`, we pick Here's the whole thing: -```rust +```lalrpop extern { type Location = usize; type Error = lexer::LexicalError; @@ -174,7 +174,7 @@ From now on, the parser will take a `Lexer` as its input instead of a string sli And any time we write a string literal in the grammar, it'll substitute a variant of our `Tok` enum. This means **we don't have to change any of the rules we already wrote!** This will work as-is: -```rust +```lalrpop FlowCtrl: ast::Stmt = { " " " "