Intro to CSTML

CSTML is a format for storing data, especially syntax trees. It takes ideas from JSON, XML, and HTML. This guide will help you understand the basics of the CSTML language by going through a series of examples.

To start with, CSTML is a markup language. The Concrete Syntax Tree Markup Language, in fact! This means that at the heart of the document is a string of text, and we "mark up" that text by inserting tags into it, very much like HTML.

Unlike HTML though in CSTML we put the text we are marking up in string quotes! It looks like this:

"The quick brown fox jumped over the lazy dog."

A CSTML document is made up of a series of tags. This simple example contains only a single tag, a literal tag, represented in CSTML as a JSON string. Multiple strings next to each other concatenate, for example this CSTML has the same meaning:

"The quick brown fox "
"jumped over the lazy dog."

Already this choice of syntax heads off a major weakness in HTML: it allows us to use pretty-formatting tools on CSTML documents without risking changing their content.

The tags that do most of the heavy lifting in CSTML are the open and close tags. These work the same way as they do in HTML and XML, selecting a region of text and describing what is inside:

<Verb> "jumped" </>

Unlike HTML and XML the tag name is not repeated in the closing tag. We also permit a JSX-y self-closing tag. Using a self-closing tag the same document could be written like this:

<Verb "jumped" />

Like HTML and XML nodes, a CSTML node can have attributes.

<Verb "jumped" { tense: "past" } />

Unlike HTML and XML, a CSTML document's attributes support arbitrary JSON structure!

<Verb { tense: "past", unicode: { graphemes: 6 } }>
   "jumped"
</>

Like HTML and XML, a CSTML node can contain other nodes:

<Thing>
  <Article "The" />
  " "
  <Adjective "quick" />
  " "
  <Adjective "brown" />
  " "
  <Noun "fox" />
</>

References

CSTML nodes can have named relationships to each other! Here we we've added named relationships for subject:, verb: and object:(and a few others) to help highlight the internal syntactic structure of our sentence:

<Sentence>
  subject:
  <Thing>
    <Article "The" />
    " "
    modifier[]: <Adjective "quick" />
    " "
    modifier[]: <Adjective "brown" />
    " "
    kind: <Noun "fox" />
  </>
  " "
  verb: <Verb "jumped" { tense: "past" } />
  " "
  <Preposition "over" />
  " "
  object:
  <Thing>
    <Article "the" />
    " "
    modifier[]: <Adverb "lazy" />
    " "
    kind: <Noun "dog" />
  </>
  "."
</>

In this example subject:, object: and kind: are what are known as references, or reference tags. The modifier[]: reference uses [] (array braces) to indicate that more modifiers may follow and that together they form a list.

Tokens

Since CSTML is a system for adding metadata to text, we want to make sure that all text can have metadata attached to it. For that reason we consider all strings that aren't explicitly wrapped in a "token node" to be implicitly wrapped in one. A token node is denoted with *, the token flag. Thus if you have the document "Hello world!" you should consider it to be short for:

<* "Hello world!" />

The expansion helps you see that there really is a place to put attributes, even on strings. We just collapse it down when there are no attributes to remove visual clutter and make documents easier to read and write.

This means that in the previous sections when we wrote <Verb "jumped" /> we now understand that this expands to <Verb> <* "jumped" /> </>.

Now the purpose of the * flag becomes more clear: it lets us make a document that is a single node again:

<*Verb "jumped" />

Gaps

One of the most interesting features of CSTML documents is that they're permitted to have gaps in them. A gap is a place where some content is known to be missing. Gaps allow a document to be a template, like form you fill out or an image with a transparent background. A gap tag is represented in a CSTML document as <//>, like this:

<$NameBadge>
  "Hello, my name is "
  <//>
</>

In this example you may also have noticed the $ in $NameBadge. This is called the "template flag" because it indicates that parts of this document may be missing -- that there may be gaps.

Gaps are particularly useful when describing code documents. A code document with a gap in it is a usually referred to as a "code snippet" or simply "a template". There are many systems which use template syntax to define code snippets, but these systems invetitably have conflicts between the syntax of the templating language and the syntax of the content being templated. By tracking gaps without needing any syntax, CSTML creates a universally-applicable system for snippets and partial code documents.

Shifting

Gaps power another feature of CSTML, shifting, which is designed to help you create documents by appending tags as you read left to right. For example when parsing the code 10 + 20 you'll first encounter a complete and valid number (10) and only as you proceed to parse further will you understand that you're parsing an addition expression:

<Number '10' />
^^^
<Addition>
  left+: <//>
  operator: '+'
  right+: <Number '20' />
</>

To understand what the tree would look like without a shift in it, take the node before the shift tag (that the shift tag is pointing to) and drop it in the gap.

<Addition>
  left+: <Number '10' />
  operator: '+'
  right+: <Number '20' />
</>

The + flag on a reference tag, like in left+: alerts the consumer that a shift may occur at this location, which is to say that another open node tag might later jump in front of the open node tag immediately following left+.

Sometimes we need to know not just what we did find, but also what we might have found. We do this with a "cover" which is a concept already well-worn in parser design. Covers are written as nodes with an _ before their name, like <_Cover>. An _: reference indicates the particular node (or cover) being covered.

<_Expression>
  _:
  <Addition>
    left+:
    <_Expression>
      _: <Number '10' />
    </>
    operator: '+'
    right+:
    <_Expression>
      _: <Number '20' />
    </>
  </>
</>

Sharp-eyed readers will notice that I've cheated a little on this example. So far I've left out the spaces! Thus my document doesn't really say 10 + 20 like I've claimed, instead it would read 10+20 which is a good bit more strenuous on the eyes. I left the spaces out because at first I didn't have any place I could put them where they wouldn't intefere with the shift operator (^^^).

Covers solve this problem by creating a place to put the space: inside the cover. Only certain types of references are allowed inside covers: a single _: for the target node, and #:, which are trivia references. Technically everywhere so far we've written " " it would have been more correct to write #: " ". Doing so here allows us to complete the example for 10 + 20 as seen by a shifting parser:

<_Expression>
  _: <Number '10' />
  #: " "
</>
^^^
<_Expression>
  _:
  <Addition>
    left+: <//>
    operator: '+'
    right+:
    <_Expression>
      #: " "
      _: <Number '20' />
    </>
  </>
</>

Namespaces

Namespaces are one of the most infamously tricky features of XML. CSTML offers a much simpler implementation of namespaces using binding tags, written :Binding::

:JS: <_Expression>
  _: :JSX: <OpenNodeTag />
</>

The key improvement over XML here is that the CSTML namespace names actually mean something. In XML the namespace name is only an alias for the namespace URL and is otherwise not used. In CSTML there are no namespace URLs, only namespace names.

While CSTML namespace names are non-negotiable, they are not universal. That is to say, a single namespace may have different names in different places: it all depends on how its parent namespace prefers to refer to it.

This is a crucial detail which informs the design. CSTML actually started out using a <Lang.Node /> syntax like XML's but when we asked ourselves if Lang. really belonged inside <Node /> in a philosophical sense, we were forced to conclude that it did not: it isn't part of the identity of Node at all!

To understand why this matters think about taking a node from one namespace and dropping it into another. Because nodes don't have to use the same name for the same namespace you might need to change the namespace name just to preserve the meaning. For example of you take a node that in XML is <Lang.Node /> and drop it into a different namespace, you might need to rewrite that node just to preserve the meaning. Say the new name for the namespace is Language, then the node needs to be changed to <Language.Node />. In XML you're forced to build a whole new node. In CSTML you only need to create a new :Language: binding to replace the :Lang: binding. Since the binding is outside the node, the node itself remains identity-stable.

To pass through multiple namespaces, just specify multiple binding tags:

:Foo: :Bar: <Node />

You can access your parent namespace with :..:. Your parent namespace might not be the same thing as your parent node's namespace though! :..: specifically refers to the namespace defined by the most recent in-scope binding tag.

You can join multiple names within a single binding tag with /. For example you might write :Foo/Bar: to get almost the same effect as :Foo: :Bar:. But it is not quite the same because :Foo/Bar: is all one binding tag while :Foo: :Bar: is two binding tags. This means that thanks to our clever definition of :..: you can escape :Foo/Bar: with just :..: as opposed to the :../..: you must use to escape :Foo: :Bar:.

Parsing

Most proposed data interchange formats fail to gain traction. This is because the value of such a format is determined by who else is using it -- who you can use it to communicate with. Normally this would be the most massive obstacle facing a new data language like CSTML. It probably still is for us, but we do have a trick up our sleeve! CSTML is an ideal format to represent the work done by a parser. CSTML uses tags to separate a document into a hierarchy of spans; Parse trees use nodes to separate a document into a hierarchy of spans. Any text file containing parseable code is a CSTML document waiting to happen! This is why we've designed BABLR to be the most flexible, powerful parser system we know of: writing a CSTML-emitting parser for a single programming language might gain us anywhere from tens of thousands to hundreds of millions of new documents compatible with our tools depending on the popularity of the language.

Making CSTML a parser-driven data language allows its community to trade the challenge of needing people to help write an initial corpus of documents for the challenge of needing people to help write an initial family of parsers. Because this challenge will make or break CSTML, great care has been taken to shape the conditions until they are favorable to the desired outcome. Potential adopters will be weighing the benefits against the costs, so we've optimized both.

We've reduced the costs of adoption by making it as easy as possible to write parsers: powerful APIs, no compile step, lots of help from the JS debugger and our logging tools. Once you've decided to do the work we also make it as valuable as possible: the list of things you can use a CSTML document for go on and on: we've used BABLR parsers on this web page to syntax highlight our code examples and provide the ability to explore the underlying parse trees in place. You can use CSTML for structural code search so that it is no longer necessary to dumb syntax down for "greppability". You can even use CSTML to refactor entire codebases: load your whole codebase into a CSTML document and our immutable agAST trees make it easy to define and execute semantically-driven changes that touch tens, hundreds, or thousands of files. That we've wedded a refactoring tool to such a powerful system of grammars ensures that you won't be forced to leave your most trusted tools behind when you want them most: when working with unfamiliar or uncommon programming languages.

CSTML even unlocks some proper sci-fi futurism: we foresee a future in which code editing is accessible to many, many more people and can be done on more kinds of devices like touchscreen tablets and even VR headsets. Where current code editors are designed to emulate the experience of sitting at a typewriter writing code on paper, our goal is to make coding feel more like snapping together Lego bricks. While the concept of logic-brick editing is not new and projects like Scratch have been quite successful, past environments for brick coding have tried to avoid syntax while we see every reason to embrace it. While snapping bricks together is an ideal UX for writing code, reading syntactic symbols is and will always be the most efficient way for humans to read code. CSTML lets us have the best of both worlds: succinct syntax as essential to professional work, but with complete syntax documents being built up from syntax bricks which even a complete novice can learn about by playing with; by seeing how the bricks snap together.

Can you imagine a future where nobody ever needs to spend their time hunting down a misplaced close paren -- yet again? I can!