CSTML is a format for storing data, especially syntax trees. It takes ideas from JSON, XML, and HTML. This guide will help you understand the basics of the CSTML language by going through a series of examples.
To start with, CSTML is a markup language. The Concrete Syntax Tree Markup Language, in fact! This means that at the heart of the document is a string of text, and we "mark up" that text by inserting tags into it, very much like HTML.
Unlike HTML though in CSTML we put the text we are marking up in string quotes! It looks like this:
"The quick brown fox jumped over the lazy dog.""The quick brown fox jumped over the lazy dog."
A CSTML document is made up of a series of tags. This simple example contains only a single tag, a literal tag, represented in CSTML as a JSON string. Multiple strings next to each other concatenate, for example this CSTML has the same meaning:
"The quick brown fox " "jumped over the lazy dog.""The quick brown fox " "jumped over the lazy dog."
Already this choice of syntax heads off a major weakness in HTML: it allows us to use pretty-formatting tools on CSTML documents without risking changing their content.
The tags that do most of the heavy lifting in CSTML are the open and close tags. These work the same way as they do in HTML and XML, selecting a region of text and describing what is inside:
<Verb> "jumped" </><Verb> "jumped" </>
Unlike HTML and XML the tag name is not repeated in the closing tag. We also permit a JSX-y self-closing tag. Using a self-closing tag the same document could be written like this:
<Verb "jumped" /><Verb "jumped" />
Like HTML and XML nodes, a CSTML node can have attributes.
<Verb "jumped" { tense: "past" } /><Verb "jumped" { tense: "past" } />
Unlike HTML and XML, a CSTML document's attributes support arbitrary JSON structure!
<Verb { tense: "past", unicode: { graphemes: 6 } }> "jumped" </><Verb { tense: "past", unicode: { graphemes: 6 } }> "jumped" </>
Like HTML and XML, a CSTML node can contain other nodes:
<Thing> <Article "The" /> " " <Adjective "quick" /> " " <Adjective "brown" /> " " <Noun "fox" /> </><Thing> <Article "The" /> " " <Adjective "quick" /> " " <Adjective "brown" /> " " <Noun "fox" /> </>
References
CSTML nodes can have named relationships to each other! Here we we've added named relationships for subject:subject:verb:verb:object:object:
<Sentence> subject: <Thing> <Article "The" /> " " modifier[]: <Adjective "quick" /> " " modifier[]: <Adjective "brown" /> " " kind: <Noun "fox" /> </> " " verb: <Verb "jumped" { tense: "past" } /> " " <Preposition "over" /> " " object: <Thing> <Article "the" /> " " modifier[]: <Adverb "lazy" /> " " kind: <Noun "dog" /> </> "." </><Sentence> subject: <Thing> <Article "The" /> " " modifier[]: <Adjective "quick" /> " " modifier[]: <Adjective "brown" /> " " kind: <Noun "fox" /> </> " " verb: <Verb "jumped" { tense: "past" } /> " " <Preposition "over" /> " " object: <Thing> <Article "the" /> " " modifier[]: <Adverb "lazy" /> " " kind: <Noun "dog" /> </> "." </>
In this example subject:subject:object:object:kind:kind:modifier[]:modifier[]:[][]
Tokens
Since CSTML is a system for adding metadata to text, we want to make sure that all text can have metadata attached to it. For that reason we consider all strings that aren't explicitly wrapped in a "token node" to be implicitly wrapped in one. A token node is denoted with **"Hello world!""Hello world!"
<* "Hello world!" /><* "Hello world!" />
The expansion helps you see that there really is a place to put attributes, even on strings. We just collapse it down when there are no attributes to remove visual clutter and make documents easier to read and write.
This means that in the previous sections when we wrote <Verb "jumped" /><Verb "jumped" /><Verb> <* "jumped" /> </><Verb> <* "jumped" /> </>
Now the purpose of the **
<*Verb "jumped" /><*Verb "jumped" />
Gaps
One of the most interesting features of CSTML documents is that they're permitted to have gaps in them. A gap is a place where some content is known to be missing. Gaps allow a document to be a template, like form you fill out or an image with a transparent background. A gap tag is represented in a CSTML document as <//><//>
<$NameBadge> "Hello, my name is " <//> </><$NameBadge> "Hello, my name is " <//> </>
In this example you may also have noticed the $$$NameBadge$NameBadge
Gaps are particularly useful when describing code documents. A code document with a gap in it is a usually referred to as a "code snippet" or simply "a template". There are many systems which use template syntax to define code snippets, but these systems invetitably have conflicts between the syntax of the templating language and the syntax of the content being templated. By tracking gaps without needing any syntax, CSTML creates a universally-applicable system for snippets and partial code documents.
Shifting
Gaps power another feature of CSTML, shifting, which is designed to help you create documents by appending tags as you read left to right. For example when parsing the code 10 + 2010 + 201010
<Number '10' /> ^^^ <Addition> left+: <//> operator: '+' right+: <Number '20' /> </><Number '10' /> ^^^ <Addition> left+: <//> operator: '+' right+: <Number '20' /> </>
To understand what the tree would look like without a shift in it, take the node before the shift tag (that the shift tag is pointing to) and drop it in the gap.
<Addition> left+: <Number '10' /> operator: '+' right+: <Number '20' /> </><Addition> left+: <Number '10' /> operator: '+' right+: <Number '20' /> </>
The ++left+:left+:left+left+
Sometimes we need to know not just what we did find, but also what we might have found. We do this with a "cover" which is a concept already well-worn in parser design. Covers are written as nodes with an __<_Cover><_Cover>_:_:
<_Expression> _: <Addition> left+: <_Expression> _: <Number '10' /> </> operator: '+' right+: <_Expression> _: <Number '20' /> </> </> </><_Expression> _: <Addition> left+: <_Expression> _: <Number '10' /> </> operator: '+' right+: <_Expression> _: <Number '20' /> </> </> </>
Sharp-eyed readers will notice that I've cheated a little on this example. So far I've left out the spaces! Thus my document doesn't really say 10 + 2010 + 2010+2010+20^^^^^^
Covers solve this problem by creating a place to put the space: inside the cover. Only certain types of references are allowed inside covers: a single _:_:#:#:" "" "#: " "#: " "10 + 2010 + 20
<_Expression> _: <Number '10' /> #: " " </> ^^^ <_Expression> _: <Addition> left+: <//> operator: '+' right+: <_Expression> #: " " _: <Number '20' /> </> </> </><_Expression> _: <Number '10' /> #: " " </> ^^^ <_Expression> _: <Addition> left+: <//> operator: '+' right+: <_Expression> #: " " _: <Number '20' /> </> </> </>
Namespaces
Namespaces are one of the most infamously tricky features of XML. CSTML offers a much simpler implementation of namespaces using binding tags, written :Binding::Binding:
:JS: <_Expression> _: :JSX: <OpenNodeTag /> </>:JS: <_Expression> _: :JSX: <OpenNodeTag /> </>
The key improvement over XML here is that the CSTML namespace names actually _mean something_. In XML the namespace name is only an alias for the namespace URL and is otherwise not used. In CSTML there are no namespace URLs, only namespace names.
While CSTML namespace names are non-negotiable, they are not universal. That is to say, a single namespace may have different names in different places: it all depends on how its parent namespace prefers to refer to it.
This is a crucial detail which informs the design. CSTML actually started out using a <Lang.Node /><Lang.Node />Lang.Lang.<Node /><Node />
To understand why this matters think about taking a node from one namespace and dropping it into another. Because nodes don't have to use the same name for the same namespace you might need to change the namespace name just to preserve the meaning. For example of you take a node that in XML is <Lang.Node /><Lang.Node />LanguageLanguage<Language.Node /><Language.Node />:Language::Language::Lang::Lang:
To pass through multiple namespaces, just specify multiple binding tags:
:Foo: :Bar: <Node />:Foo: :Bar: <Node />
You can access your parent namespace with :..::..::..::..:
You can join multiple names within a single binding tag with //:Foo/Bar::Foo/Bar::Foo: :Bar::Foo: :Bar::Foo/Bar::Foo/Bar::Foo: :Bar::Foo: :Bar::..::..::Foo/Bar::Foo/Bar::..::..::../..::../..::Foo: :Bar::Foo: :Bar:
Parsing
Most proposed data interchange formats fail to gain traction. This is because the value of such a format is determined by who else is using it -- who you can use it to communicate with. Normally this would be the most massive obstacle facing a new data language like CSTML. It probably still is for us, but we do have a trick up our sleeve! CSTML is an ideal format to represent the work done by a parser. CSTML uses tags to separate a document into a hierarchy of spans; Parse trees use nodes to separate a document into a hierarchy of spans. Any text file containing parseable code is a CSTML document waiting to happen! This is why we've designed BABLR to be the most flexible, powerful parser system we know of: writing a CSTML-emitting parser for a single programming language might gain us anywhere from tens of thousands to hundreds of millions of new documents compatible with our tools depending on the popularity of the language.
Making CSTML a parser-driven data language allows its community to trade the challenge of needing people to help write an initial corpus of documents for the challenge of needing people to help write an initial family of parsers. Because this challenge will make or break CSTML, great care has been taken to shape the conditions until they are favorable to the desired outcome. Potential adopters will be weighing the benefits against the costs, so we've optimized both.
We've reduced the costs of adoption by making it as easy as possible to write parsers: powerful APIs, no compile step, lots of help from the JS debugger and our logging tools. Once you've decided to do the work we also make it as valuable as possible: the list of things you can use a CSTML document for go on and on: we've used BABLR parsers on this web page to syntax highlight our code examples and provide the ability to explore the underlying parse trees in place. You can use CSTML for structural code search so that it is no longer necessary to dumb syntax down to for "greppability". You can even use CSTML to refactor entire codebases: load your whole codebase into a CSTML document and our immutable agAST trees make it easy to define and execute semantically-driven changes that touch tens, hundreds, or thousands of files. That we've wedded a refactoring tool to such a powerful system of grammars ensures that you won't be forced to leave your most trusted tools behind when you want them most: when working with unfamiliar or uncommon programming languages.
CSTML even unlocks some proper sci-fi futurism: we foresee a future in which code editing is accessible to many, many more people and can be done on more kinds of devices like touchscreen tablets and even VR headsets. Where current code editors are designed to emulate the experience of sitting at a typewriter writing code on paper, our goal is to make coding feel more like snapping together Lego bricks. While the concept of logic-brick editing is not new and projects like Scratch have been quite successful, past environments for brick coding have tried to avoid syntax, while we see every reason to embrace it. While snapping bricks together is an ideal UX for writing code, reading syntactic symbols is and will always be the most efficient way for humans to read code. CSTML lets us have the best of both worlds: succinct syntax as essential to professional work, but with complete syntax documents being built up from syntax bricks which even a complete novice can learn about by playing with; by seeing how the bricks snap together.
Can you imagine a future where nobody ever needs to spend their time hunting down a misplaced close paren -- yet again? I can!