Skip to content

Stream iterators

One of the innovations that makes BABLR work efficiently is its use of stream iterators. The BABLR parser operates on streams at every level: it uses streams of characters for input, streams of tags for output, and streams of instructions to control the parser. In fact we rely so heavily on streams that for reasons of abstraction and performance we've had to develop our own way of representing them.

Iterators add a layer of indirection to data access, and in Javascript they have a performance cost. What they offer in return is an abstraction over how data is stored. You can write a method parse(chrs) which sees its input a sequence of characters without needing to say whether the characters are stored in a String or an Array or a Buffer, perhaps even a rope as is used in text editors. Actually presenting an abstraction over in-memory data structures is a particularly easy problem: it only requires synchronous iteration.

In practice sync iteration only abstracts away how data is laid out in memory. To be able to abstract over the differences between something like a character stream held in a string and a character stream being loaded over a network socket you need a different kind of iterator: one which can be suspended and resumed at a different moment in time. The Javascript standards present this as an "async iterator".

Unfortunately, async iterators as a construct were designed incorrectly. I've been quite careful to lay out the purpose of iterators as a time-sequence abstraction over storage, and I hope it seems natural that both a network stream of data and a string are natural to think of in the abstract as sequences of characters. The problem, then, is that async iterators do not provide an abstraction over this!!!

Far from giving you an abstraction, Javascript instead makes sync and async iteration two completely separate concrete things. Not just the official advice but the community best practice is to literally write all your code twice: once using synchronous iterators and synchronous for .. of loops so that you have one implementation that's fast and can be called from inside synchronous execution environments. Then you're expected to maintain a complete second copy of your code which uses async function and for await .. of and asynchronous iterators. I hope I do not need to tell you that having to maintain two completely separate copies of your code is not what a successful abstraction looks like! Exactly the opposite: such a pronounced schism is what you expect to see if the attempt to make this distinction abstract has been unsuccessful.

Let's see what all this looks like in code:

// The sync API must exist for callers in sync functions
function* syncParse(chrs) {
  for (let chr of chrs) {
    /* ...do parsing stuff */
    yield tag;
  }
}

// The async API must exist for when data is a network stream
async function* asyncParse(chrs) {
  for await(let chr of chrs) {
    /* ...do parsing stuff */
    yield tag;
  }
}

Because BABLR is the most ambitious application ever conceived and executed on top of Javascript data streams, writing all our code twice was not a realistic option for us. To be able to move forward we needed to create a successful abstraction. We call the result "stream iterators". The easiest way of describing stream iterators is using the syntax we would propose:

async? function* streamParse(chrs) {
  for await? (let chr of chrs) {
    /* ...do parsing stuff */
    yield tag;
  }
}

While we can't really use this async? syntax yet, Javascript already makes it possible to write code which works this way: code which is sync when it can be and async when it needs to be. By preserving the syncness of the input and output when it is possible to do so, we allow ourselves to write one parse function that serves all needs. When the input is a string you're able to get a parse result syncronously. When the input is a network stream, some waiting will occur as necessary.

Here's how we can write streamParse using currently-available tools:

import { streamify, wait } from '@bablr/stream-iterator';

let streamParse = streamify(function* (chrs) {
  let iter = chrs[Symbol.streamIterator]();
  let step;
  while (true) {
    // iter.next returns either a step object or a promise
    step = iter.next();

    if (step instanceof Promise) {
      // resolve a promise to a step object
      // this line causes this generator to emit a promise as a step
      step = yield wait(step);
    }

    if (step.done) break;
  
    // either yields a step synchronously or fulfills a step promise
    yield tag;
  }
});

While this code is certainly messier looking, it succeeds handsomely in its purpose: we now have a practical method to define a parsing algorithm once while being truly agnostic to whether the data we're parsing lives in a string or on the filesystem or on the internet. We're now properly able to match a cost to a benefit!