Tree-sitter is a tokenizer that uses native modules. The idea is that a language generates an AST of the source code and then Pulsar will tokenize these on the editor with some rules on CSON files (that kind of resemble CSS selectors)
Inside Pulsar's source code is possible to require Tree-Sitter and try to parse some grammar. To do this, run this code on Devtools:
const Parser = require('tree-sitter');
const Java = require('tree-sitter-java');
const parser = new Parser();
parser.setLanguage(Java);
tree = parser.parse(`
class A {
void func() {
obj.func2("arg");
}
}
`);
console.log(tree.rootNode.toString());
This will create a parser, set its language to Java, and try to parse the source code that we sent. This specific fragment of code will print:
(program
(class_declaration
name: (identifier)
body: (class_body
(method_declaration
type: (void_type)
name: (identifier)
parameters: (formal_parameters)
body: (block
(expression_statement
(method_invocation
object: (identifier)
name: (identifier)
arguments: (argument_list
(string_literal)))))))))
I did the pretty-print manually. Basically, it says that the "root node" is a program
that contains a class_declaration
. Following that, comes the class's name, then its body, etc etc.
If you look at the AST above, you'll see that there are things inside parenthesis and things like name:
and body:
. This second one is what Tree-Sitter now calls "field name", and Pulsar is not yet using this anywhere. This is problematic for multiple reasons, but the main one is that tokenization gets wrong: for example, in the code above, we want to tokenize obj.func2("arg")
by marking func2
as a function that's being called, but the AST for that fragment is:
(method_invocation
object: (identifier)
name: (identifier)
arguments: (argument_list (string_literal)))
What disambiguates the method name from other things is the field name: obj
have field name object
, and func2
have field name name
. As Pulsar is not parsing this, the closest match we can get is:
'method_invocation > identifier': 'entity.name.function'
But unfortunately, this does not solve the issue - both obj
and func2
are tokenized as functions in this case.
src/tree-sitter-language-mode.js
is where the syntax tree is walked to generate tokens. It basically have methods like seek
, _moveDown
, etc that .push
some token into containingNodeTypes
and other local fields. Later, these are tokenized via _currentScopeId
that basically tries to match the rule we're in inside this.languageLayer.grammar.scopeMap
data structure.
This data structure is defined in src/syntax-scope-map.js
, and contains anonymousScopeTable
(that is, AFAIK, a list of words that are tokenized always the same - think like "keywords" on the language) and a namedScopeTable
(which, surprisingly, does not treat the "field name" even though it has name
on it). This structure is basically a "leaf first" structure. So, tokenizing obj.func("a string")
, we would get:
method_invocation
, that gets push
ed into containingNodeTypes
then we "move down"identifier
(for obj
), that also gets push
ed into identifier
identifier
(for func
), replaces the sibling's identifier
that was pushed before, and we check tokenID, and "move right" againargument_list
instead of method_invocation
(replace the sibling's identifier
with argument_list
, then move down to push string_literal
)pop
ing the string_literal
, then argument_list
, and finally method_invocation
, and continue walking the rest of the ASTTo get the Token ID, we walk though the data structure, checking things as we go. So for example, in this case, after push
ing things for obj
, we have inside containingNodeTypes
: ['method_invocation', 'identifier']
. We have this same structure for func
.
If we look at the scopeMap
structure, inside namedScopeTable
, we'll see something like:
identifier: {
parents: {
method_invocation: {
result: ["entity.name.function", ...]
}
}
}
And this is how the tokenizer is done. Is also how the bug appears: both func
and obj
have the same containingNodeTypes
.
To make src/syntax-scope-map.js
aware of "named fields" (we can do that by checking the cursor.currentFieldName
or by push
ing the this.treeCursor.currentFieldName
), then match things correctly.
We will also need to decide on a syntax on the CSON file to this format, and also parse this format inside the namedScopeTable
.
Finally, we'll need to change the get
method of the SyntaxScopeMap
to match things correctly and get tokenization for things filtered by the field name.