tree-sitter.md 5.0 KB

Tree Sitter in Pulsar

Tree-sitter is a tokenizer that uses native modules. The idea is that a language generates an AST of the source code and then Pulsar will tokenize these on the editor with some rules on CSON files (that kind of resemble CSS selectors)

Debugging a Grammar

Inside Pulsar's source code is possible to require Tree-Sitter and try to parse some grammar. To do this, run this code on Devtools:

const Parser = require('tree-sitter');
const Java = require('tree-sitter-java');

const parser = new Parser();
parser.setLanguage(Java);

tree = parser.parse(`
class A {
  void func() {
    obj.func2("arg");
  }
}
`);
console.log(tree.rootNode.toString());

This will create a parser, set its language to Java, and try to parse the source code that we sent. This specific fragment of code will print:

(program
  (class_declaration
    name: (identifier)
    body: (class_body
      (method_declaration
        type: (void_type)
        name: (identifier)
        parameters: (formal_parameters)
        body: (block
          (expression_statement
            (method_invocation
              object: (identifier)
              name: (identifier)
              arguments: (argument_list
                (string_literal)))))))))

I did the pretty-print manually. Basically, it says that the "root node" is a program that contains a class_declaration. Following that, comes the class's name, then its body, etc etc.

Modern tree-sitter

If you look at the AST above, you'll see that there are things inside parenthesis and things like name: and body:. This second one is what Tree-Sitter now calls "field name", and Pulsar is not yet using this anywhere. This is problematic for multiple reasons, but the main one is that tokenization gets wrong: for example, in the code above, we want to tokenize obj.func2("arg") by marking func2 as a function that's being called, but the AST for that fragment is:

(method_invocation
  object: (identifier)
  name: (identifier)
  arguments: (argument_list (string_literal)))

What disambiguates the method name from other things is the field name: obj have field name object, and func2 have field name name. As Pulsar is not parsing this, the closest match we can get is:

  'method_invocation > identifier': 'entity.name.function'

But unfortunately, this does not solve the issue - both obj and func2 are tokenized as functions in this case.

Fixing this

src/tree-sitter-language-mode.js is where the syntax tree is walked to generate tokens. It basically have methods like seek, _moveDown, etc that .push some token into containingNodeTypes and other local fields. Later, these are tokenized via _currentScopeId that basically tries to match the rule we're in inside this.languageLayer.grammar.scopeMap data structure.

This data structure is defined in src/syntax-scope-map.js, and contains anonymousScopeTable (that is, AFAIK, a list of words that are tokenized always the same - think like "keywords" on the language) and a namedScopeTable (which, surprisingly, does not treat the "field name" even though it has name on it). This structure is basically a "leaf first" structure. So, tokenizing obj.func("a string"), we would get:

  1. method_invocation, that gets pushed into containingNodeTypes then we "move down"
  2. identifier (for obj), that also gets pushed into identifier
  3. We check the tokenID then "move right"
  4. identifier (for func), replaces the sibling's identifier that was pushed before, and we check tokenID, and "move right" again
  5. Repeate the process from the beginning, but for argument_list instead of method_invocation (replace the sibling's identifier with argument_list, then move down to push string_literal)
  6. Finally "move up", poping the string_literal, then argument_list, and finally method_invocation, and continue walking the rest of the AST

To get the Token ID, we walk though the data structure, checking things as we go. So for example, in this case, after pushing things for obj, we have inside containingNodeTypes: ['method_invocation', 'identifier']. We have this same structure for func.

If we look at the scopeMap structure, inside namedScopeTable, we'll see something like:

identifier: {
  parents: {
    method_invocation: {
      result: ["entity.name.function", ...]
    }
  }
}

And this is how the tokenizer is done. Is also how the bug appears: both func and obj have the same containingNodeTypes.

Possible solution

To make src/syntax-scope-map.js aware of "named fields" (we can do that by checking the cursor.currentFieldName or by pushing the this.treeCursor.currentFieldName), then match things correctly.

We will also need to decide on a syntax on the CSON file to this format, and also parse this format inside the namedScopeTable.

Finally, we'll need to change the get method of the SyntaxScopeMap to match things correctly and get tokenization for things filtered by the field name.