pm39 tree sitter parser

From Wiki
Jump to navigation Jump to search

Notes and such about the Tree-Sitter ConTeXt parser.

Features

Version 0.6 of the tree-sitter-context_en parser supports the following features:

Document Areas

If document start(text, component) and stop(text, component) commands exit in the document, the parser will build a tree with preamble, main, and postamble nodes. Dividing the document this way makes it easier for tools (that may want to ignore the postamble, for example).

If no start- or stop- commands exist in the document, all content is contained in a main node.

Commands

The parser tokenizes commands into:

  • name
  • zero or more option blocks (square brackets with keywords)
  • zero or more settings blocks (square brackets with key=val pairs)
  • zero or more scopes (curly braces after the command)

Settings are further tokenized into keys and values, with values able to contain other tokens (more commands, etc.).

Groups

The parser understands the following types of groupings:

  • Brace groups (starting with "{" or "\bgroup", and ending with "}" or "\egroup")
  • "Command" groups (starting with "\start" and ending with "\stop")

Inline Math

The parser supports minimal handling of inline math.

(Future work: more math support!)

Inclusions

Code Inclusions

The parser supports marking the following inclusions for inlined code:

  • luacode
  • tikzcode
  • MPinclusions
  • useMPgraphic
  • reuseableMPgraphic
  • MPcode
  • MPpage
  • staticMPfigure

Note that the parser will make these areas for external parsing, but nothing will happen if the external parser isn't available.

(As of this writing, an external parser exists for Lua, but not for MetaPost or TiKz.)

Typing Environment Inclusions

The parser supports marking the following typing environments:

  • MetaPost
  • Lua
  • HTML
  • CSS
  • XML
  • PARSEDXML

...and a generic typing inclusion.

Other Things

The parser marks commands relating to project structure.

The parser marks escaped characters (and will complain about unescaped characters that should be, except in special circumstances.)

The parser should be line-ending agnostic.

Future Directions

  • Parse and include more of the document structure in the syntax tree? (Reflect chapters, sections, etc. in the syntax tree? What to do about user-defined headings?)
  • Table support for the parser? (which model(s)?)
  • Better math support?
  • Better programming support? (Explicitly tag things like loop and branch commands?)
  • More inclusions? (Markdown?)
  • Other ConTeXt interface languages?
  • Should the parser be more strict about what's allowed in the preamble?