melbalabs journal

practical mysticism

14 Oct 2024

parsing notes

Gradually moving from ad-hoc parsing to using parsing toolkits.

problem statement

We have data in one or more formats that we have to understand and represent in computer memory so that we can process it further. We'll investigate practical ways to solve the problem.

recommended approach - using libraries

The main tool get work done are well known formats used via proven libraries. This is a good place to be, because it's a balance between standardization, popularity and code reuse. Inventing new data formats may show you are not familiar with prior work and is harder than expected.

For example the python standard library has a lot of parsers. Some plain text parsers are json, csv (many variants), ini, toml, http, email, html, xml. Examples for binary formats are sqlite3, zip, tar. The python package index has good libraries for yaml.

Python has parsers for less strictly defined formats like dates, times, timestamps. There's support for decoding binary data into unicode, signed and unsigned integers, floats.

The human friendly data formats are plain text, so usually they are less strict, not as well thought out and often require custom parsers.

ad hoc parsing of simple text based formats

It's easier to start a project in plain text. We can do exploratory work with just basic text editing tools while the datasets are relatively small. When the program manages plain text data it's easier to eyeball if it's working correctly.

A good place to start is a line-based format of space separated values. For example the lines can represent different independent entities (eg people) and each line can contain its related properties (eg name, address, id). Another way to say is lines are records, words are fields of each record.

with open(filename) as f:
  for line in f:
    values = line.strip().split()
    name, address, id = values

At this point values is a list of strings and it's unpacked into separate variables.

Most likely id is an integer, so we further have to convert it to the more appropriate int type with id = int(id).

Then validate that id is a positive number with eg if id <= 0: raise RuntimeError('oops')

The key observations so far are:

  • splitting all the data into separate lines via line-based iteration. Think of this as splitting all the data with newline as separator
  • splitting lines into multiple values via space as separator
  • removing the extra spaces and newlines from the input data as they are used only for formatting
  • assigning variables for each value
  • converting values to different types. id becomes an int.
  • during the data conversion from string to int, id is validated as a valid integer.
  • after the data conversion, id is validated more to ensure it's positive.
  • there can be an arbitrary number of records
  • there can be an arbitrary number of fields
  • there can be fields of arbitrary size

Overall this is still a good place to be. A data format like this can be easy to parse and can accomodate a very large amount of problems. In fact the early internet (http, irc, email, telnet) ran on variations of this approach.

It's a problem if the data we're trying to work with includes the delimiters. When the delimiters are spaces and newlines, a simple solution is to change the delimiters to something more unique. The advantage is the data remains mostly human readable. Another solution is to start encoding parts of the data with things like multipart/form-data, base64, ascii85. This comes at the price of readability and might be a hint to start thinking about a fancier solution. I recommend going back to the drawing board, change formats and use a library.

ad hoc parsing of advanced text based formats

The early versions of the IRC protocol use a "lines of text" format to describe message passing between clients and servers. Each message is a command with a variable number of arguments.

Note how a data format and a communication protocol are related. The data format describes how all messages look like. It's a relatively low level concept, this is where parsing happens and is what this discussion is about.
A protocol describes how to order and interpret the messages for the higher level goal of achieving successful communication. There's an even higher level goal above the communication - the reason we are communicating, the task we're trying to achieve via the communicaiton. Those higher level concepts will not be discussed at all.

It's easier to start investigating IRC messages with examples.

:melba!melba@irc.example.com PRIVMSG #somechannel :hello

This message follows the template :<nick>!<user>@<host> <command> #<channel> :<text>
We see there are multiple delimiters with variable-length data between them. For simplicity assume the message has been decoded into unicode text from a utf8 byte string.


msg = ':melba!melba@irc.example.com PRIVMSG #somechannel :hello'

pos = 1
sender = []
nick = ''
while pos < len(msg) and msg[pos] != '!':
    nick += msg[pos]
    pos += 1
sender.append(nick)

assert msg[pos] == '!'
pos += 1
user = ''
while pos < len(msg) and msg[pos] != '@':
    user += msg[pos]
    pos += 1
sender.append(user)

assert msg[pos] == '@'
pos += 1
host = ''
while pos < len(msg) and msg[pos] != ' ':
    host += msg[pos]
    pos += 1
sender.append(host)

assert msg[pos] == ' '
pos += 1
command = ''
while pos < len(msg) and str.isalnum(msg[pos]):
    command += msg[pos]
    pos += 1

assert msg[pos] == ' '
pos += 1

channel = ''
while pos < len(msg) and msg[pos] != ' ':
    channel += msg[pos]
    pos += 1

assert msg[pos] == ' '
pos += 1
assert msg[pos] == ':'
pos += 1

text = ''
while pos < len(msg):
    text += msg[pos]
    pos += 1

assert sender == ['melba', 'melba', 'irc.example.com']
assert command == 'PRIVMSG'
assert channel == '#somechannel'
assert text == 'hello'


The code goes character by character and assigns each delimited field to its respective variable.

The protocol has many more message types, but has been designed such that it's easy to process. Around 50 lines of similar code are enough to parse everything. See the irc specification https://www.rfc-editor.org/rfc/rfc2812.txt and an example ad-hoc parser https://github.com/melbaa/ircfw/blob/master/ircfw/parse.py

Observations:

  • we don't need much more than simple character comparisons and loops to extract all relevant data from the raw message
  • a single pass over the message is enough. there's no ambiguity, forward or back references even when parts of the message are optional and depend on the message type
  • user provided data is at the end of the message
  • user provided data is unstructured and can contain all other delimiters. All the protocol message framing is effectively a prefix of this unstructured data.
  • the design didn't happen randomly, it took expertise to achieve such simplicity and flexibility

It quickly gets old to write out every loop, the next easy upgrade is to use string processing libraries.

Example of going from separator to separator using str.find(), halves the code in this case, overall not a huge improvement.


msg = ':melba!user@irc.example.com PRIVMSG #somechannel :hello with : delimiters ! and @'

nick_end = msg.find('!', 1)
nick = msg[1:nick_end]

user_end = msg.find('@', nick_end)
user = msg[nick_end+1:user_end]
host_end = msg.find(' ', user_end)
host = msg[user_end+1:host_end]

sender = [nick, user, host]

command_end = msg.find('#', host_end+1)-1
command = msg[host_end+1:command_end]
channel_start = command_end+1
channel_end = msg.find(' ', channel_start)
channel = msg[channel_start:channel_end]
text = msg[channel_end+2:]

assert sender == ['melba', 'user', 'irc.example.com']
assert command == 'PRIVMSG'
assert channel == '#somechannel'
assert text == 'hello with : delimiters ! and @'


Another implementation but with str.split() instead. It's less verbose with the usual tradeoff of being less efficient.


msg = ':melba!user@irc.example.com PRIVMSG #somechannel :hello with : delimiters ! and @'

rest = msg[1:]
nick, rest = rest.split('!', 1)

user, rest = rest.split('@', 1)
host, rest = rest.split(' ', 1)

sender = [nick, user, host]

command, channel, rest = rest.split(' ', 2)
text = rest[1:]

assert sender == ['melba', 'user', 'irc.example.com']
assert command == 'PRIVMSG'
assert channel == '#somechannel'
assert text == 'hello with : delimiters ! and @'


a little theory

Interestingly, when trying to work with plain text, we quickily enter intermediate and expert territory and some amount of computer science theory becomes unavoidable.

There are multiple scientific fields interested in what we're trying to do. To quote wikipedia:

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

Each field came up with descriptions and classifications of different languages and grammars and approaches to understand them. From least generic to most generic they are:

  • type 3 - regular (counter free automata) - concatenation, repetition (without counting), choice
  • type 2 - context free (nfa, dfa) - nesting, recursion
  • type 1 - context sensitive (linear bounded automata - has limited amount of tape) - bounded context
  • type 0 - unrestricted/recursively enumerable (turing machine, unlimited tape) - unlimited context

In practice simple problems lend themselves to solutions with tools targetting type 3 and type 2 grammars. Using the full capabilities of a computer ends up requiring context sensitivity in some way.

Often extra pre- post- and intra-processing extensions are added to context-free and regular tools to enable them to handle more generic grammars. Such extensions usually are things like

You can't check code you can't parse. Checking code deeply requires understanding the code's semantics. The most basic requirement is that you parse it. Parsing is considered a solved problem. Unfortunately, this view is naïve, rooted in the widely believed myth that programming languages exist.

The C language does not exist; neither does Java, C++, and C#. While a language may exist as an abstract idea, and even have a pile of paper (a standard) purporting to define it, a standard is not a compiler. What language do people write code in? The character strings accepted by their compiler. Further, they equate compilation with certification. A file their compiler does not reject has been certified as "C code" no matter how blatantly illegal its contents may be to a language scholar.

Many (all?) compilers diverge from the standard. Compilers have bugs. Or are very old. Written by people who misunderstand the specification (not just for C++). Or have numerous extensions.

In safety-critical software systems, changing the compiler often requires costly re-certification. Thus, we routinely see the use of decades-old compilers. While the languages these compilers accept have interesting features, strong concordance with a modern language standard is not one of them. Age begets new problems. Realistically, diagnosing a compiler’s divergences requires having a copy of the compiler.

A note on regex from http://trevorjim.com/parsing-not-solved/

"Regular expressions" are widely used, but they don't correspond to the regular grammars of language theory. For example, regular expression libraries use a "longest-match" semantics, which means that they are context-sensitive—they are not even context-free, much less regular. Other common features, such as matching the beginning of a line or a previous portion of input, are also context-sensitive. These features are implemented because they are needed: regular languages alone are not sufficient to handle the parsing problems we need to solve.

pattern matching

The next generalization of string processing libraries are those that support pattern matching. Lua has an interesting implementation - it's simple yet can solve a lot of problems.

The lua library has functions for substitution, searching, matching and supports character wildcards and classes. Compared to regex, the first thing we see missing is support for grouping and alteration. The result is that more complex string processing ends up being a mix of lua code and patterns.

Here's an example of splitting a string

local s = "one;two;;four"
local words = {}
for w in (s .. ";"):gmatch("([^;]*);") do
    table.insert(words, w)
end

-- print output
for n, w in ipairs(words) do
    print(n .. ": " .. w)
end

-- output
1: one
2: two
3:
4: four

Here's how to extract item links in world of warcraft addons with captures (taken from https://github.com/melbaa/TopMeOff/blob/d476f576e00b63febd1038bee8c7097d0e47ea34/TopMeOff.lua#L71)

txt_to_search = 'hello world |cffa335ee|Hitem:23577:0:0:0|h[The Hungering Cold]\|h|r  goodbye'

_, _, itemLink = string.find(txt_to_search, "(|c%x+|Hitem:%d+:%d+:%d+:%d+|h%[.-%]|h|r)")
if not itemLink then
    info('an item link is required')
    return
end

regex

The regular expression libraries are a good way to create completely unmaintainable code, but they are a potent tool to solve many parsing problems. Regex is often the first thing people use for their parsing issues mostly because the libraries have an extremely large amount features. Unfortunately every regex engine has slightly different syntax, so mindlessly copy-pasting regex patterns leads to subtle bugs. Code with regular expressions often isn't easily portable and reusable.

Here's how to parse the irc message with regex.


import re

msg = ':melba!user@irc.example.com PRIVMSG #somechannel :hello with : delimiters ! and @'

nick, user, host, command, channel, text = re.fullmatch(':([^!]+)!([^@]+)@([^ ]+) ([^ ]+) ([^ ]+) :(.*)', msg).groups()

sender = [nick, user, host]

assert sender == ['melba', 'user', 'irc.example.com']
assert command == 'PRIVMSG'
assert channel == '#somechannel'
assert text == 'hello with : delimiters ! and @'


Is it an improvement? Sure, if you're an expert. How does one become an expert? By learning about and working with the tools and theory behind them.

To explain the pattern a bit:

  • match :
  • capture everything that's not an !
  • match !
  • capture everything that's not an @
  • match @
  • capture everything that's not a whitespace.
  • match a whitespace
  • capture everything that's not a whitespace.
  • match a whitespace
  • capture everything that's not a whitespace.
  • match a whitespace
  • match :
  • capture everything that remains

Don't try to process nested and recursive data with just regex, you'll find they are not sufficient.
Don't try to process arbitrary text, such as html, with regex - it's futile.
Do try to understand the problem you're solving in depth. This leads to better code, regular expressions included.

Regular expressions are usually one of the building blocks of lexers and tokenizers - learn them to expand your toolset.

Once you're comfortable with the basics and want to learn more, the articles by Russ Cox are a great resource. https://swtch.com/~rsc/regexp/regexp1.html

input validation

In any software program, the parser is the first line of defense against adversaries. An attacker can only compromise your system by interacting with it, and that means sending it input, which is processed first by your parser. The parser is therefore the ideal place to filter out bad inputs, before they reach the rest of your program.

We have to escape and validate all external input somehow. The webdev world has a lot of lessions on the topic. A simple first step is to forbid special characters, which can be abused in all kinds of ways if they aren't escaped properly.

As far as parsing is concerned, we can try making a parser that allows only valid inputs - easier said than done.

Regex often shows up in input validators, but as usual look for alternatives first and use libraries when possible.

binary formats

The advantage of binary formats is they are made for machine consumption, so they are designed to be parseable from the start. The disadvantage is that because the formats aren't human readable, it's not as easy to see patterns and figure out what's going on without user documentation and design documents and software tooling. A simple-looking sequence of bytes can actually contain multiple types of data encodings, compressions, numeric constants with obscure meaning, segmentation, framings, endianness.

A nice generic and relatively simple binary format that can be useful to review is BSON https://bsonspec.org/spec.html.
It starts with the size of the document then a list of <data-type> <name> <value> tuples. One of the interesting data types is an embedded BSON document, so the format is recursive.

Another interesting binary format that is also human readable is described by the redis serialization protocol (RESP https://redis.io/docs/latest/develop/reference/protocol-spec/). It describes request-response messaging and a how to encode and decode many different data types for transfer across a network.

parser combinators

Parser combinators are a functional way of composing recursive-descent parsers. As the name implies, it's a good tool to use for grammars with recursion and nesting.

Hand written recursive descent parsers are a bunch of mutually recursive function calls. They are very flexible and customizable, but may be harder to maintain and reuse, a bit like hand written assembly. Combinators usually take away some of the flexibility to provide more structure and maintainability.

A naively written combinator needs exponential time for ambiguous grammars. Memoization reduces time complexity to polynomial. It's still better to avoid/forbid ambiguity at the grammar level.
Side note: lack of ambiguity (eg always picking the first possible choice) + memoization are also the main ideas behind parsing expression grammars (PEG), which are a subset of and another way to describe recursive descent parsers.

Combinators are useful if the inputs are small, so ambiguous grammars don't cause your program to take forever.

Combinator libraries usually are scanerless parsers, they don't separate a lexing from a parsing stage, it's mixed together.
Note that nothing prevents you from creating separate lexing and parsing stages yourself.

A popular parsing combinator library for python is pyparsing. It has a long history, enough documentation to get started and plenty of examples around. Its code is understandable as a reference.

example - quake live config parser

Games built on the quake engine are known for providing players hundreds of settings to change. The games have a relatively simple custom config file format. The engine is extremely forgiving with what it accepts - it ignores everything it doesn't understand.
People loved collecting and sharing configs, so configs are full of ascii art made with comments, whitespace or just plain English, which the engine ignores.
Format overview:

  • line based. Commands don't span multiple lines
  • lines end with a newline \n. Carriage returns \r are allowed and ignored.
  • there can be multiple commands per line separated by semicolons ;
  • commands are a single word followed by zero or more arguments
  • arguments are separated by whitespace - spaces or tabs
  • each argument is a single word or multilpe words surrounded by double quotes
  • double quoted strings continue until they are closed by a quote or a newline
  • comments allowed everywhere and continue until end of line
  • c-style multiline comments are allowed. /* comment here */

Example config

// 1st December 2008

unbindall

//_Cvar_________________________"Value"_________________"Value"//comment

exec				"cfgs/vstr.cfg"		// Kept vstrs in another config (cfgs\vstr.cfg), to keep this one tidier
bind				"F10"			"exec emsixteenql.cfg"
bind				"F11"			"exec cfgs/demoreplay.cfg"
bind				"mouse2"			"weapon 6;cg_fov 120;  sensitivity 2.2;cg_drawcrosshair 6;bind g vstr rail5"
set rail5 "weapon 7;cg_fov 90;sensitivity 2;cg_drawcrosshair 3;bind g vstr rail7"
seta r_picmip			"16"			// Lower picmip like 5 looks like a horrible blurry mess, 10 is clean

// Nickname
seta name			"cha0ticz"

Here's how the parser looks like (from https://github.com/melbaa/quakeconfig/blob/master/libs/cfgtokenizer.py)

    ws = ' \t'
    ParserElement.setDefaultWhitespaceChars(ws)

    identifier = Word(printables.replace(';', ''))

    dblquoted_string_unclosed = Regex(r'"[^"\r\n]*?(?=[\r\n])')

    dblquoted_string_closed = Regex(r'"[^"\r\n]*?"')

    lineComment = dblSlashComment + OneOrMore(lineEnd)

    cStyleComment = Regex(r"/\*(?:[^*]|\*(?!/))*\*\/")

    arg = dblSlashComment.suppress() | dblquoted_string_closed  \
        | dblquoted_string_unclosed | identifier

    seta_cvar = (Literal('seta') | Literal('set')).suppress() + \
        identifier + ZeroOrMore(arg)

    cvar = identifier + ZeroOrMore(arg)

    cmd = Optional(';').suppress() + \
        Group((seta_cvar | cvar) + Optional(';').suppress())

    line = lineComment.suppress() | OneOrMore(
        cmd) | OneOrMore(lineEnd).suppress()

    cfg_grammar = (ZeroOrMore(cStyleComment.suppress() | line)) + stringEnd

Note it's easier to read the rules from bottom to top.

Let's look at some of the regular expressions.

dblquoted_string_unclosed = Regex(r'"[^"\r\n]*?(?=[\r\n])')

  • match an opening "
  • match everything not any of "\r\n. As few as possible.
  • match an optional "
  • ensure a \r or \n follow, but don't match them

cStyleComment = Regex(r"/\*(?:[^*]|\*(?!/))*\*\/")

  • match /*
  • match all non * characters or * followed by non / characters
  • match */

example - filter language parser

The next example parses a simple filter language that eventually gets converted to sql via a parametrized query.


import collections

import pyparsing as pp

def filter_action(s, l, t):
    sql = ''
    params = []
    tokens = t
    print('filter_action', tokens)
    for tok in tokens:
        if isinstance(tok, str):
            sql += ' ' + tok
        elif isinstance(tok, BraceExpr):
            sql += ' ' + tok.sql
            params.extend(tok.params)
        else:
            raise RuntimeError('unknown type {}'.format(type(tok)))
    return Filter(sql, params)

def expr_action(s, l, t):
    tok = t[0]
    sql = '"%s" {op} %s'.format(op=tok.op)
    params = [tok.var, tok.val]
    return Expr(sql, params)

def brace_expr_action(s, l, t):
    sql = ''
    params = []
    tokens = t[0]
    print('brace_expr_action', tokens)
    for tok in tokens:
        if isinstance(tok, str):
            sql += ' ' + tok
        elif isinstance(tok, (Expr, Filter)):
            sql += ' ' + tok.sql
            params.extend(tok.params)
        else:
            raise RuntimeError('unknown type {}'.format(type(tok)))
    return BraceExpr(sql, params)

relops = {
    'gt': '>',
    'ge': '>=',
    'lt': '<',
    'le': '<=',
    'eq': '=',
    'ne': '!=',
}

def replace_relops_action(s, l, t):
    return relops[t[0]]



Expr = collections.namedtuple('Expr', ['sql', 'params'])
BraceExpr = collections.namedtuple('BraceExpr', ['sql', 'params'])
Filter = collections.namedtuple('Filter', ['sql', 'params'])
def parse_grammar(inp):
    var = pp.Word(pp.alphas+'_')('var')

    op = pp.one_of(list(relops.keys()))('op')
    op.set_parse_action(replace_relops_action)

    quoted_word = pp.Combine(pp.Literal("'").suppress() + pp.Regex(r"[^']*") + pp.Literal("'").suppress())

    val = (quoted_word)('val')

    expr = pp.Group(var + op + val)('expr')
    expr.set_parse_action(expr_action)

    filter = pp.Forward()
    maybe_negation = pp.Optional('not')

    brace_expr = pp.Group(maybe_negation + (pp.Literal('(') + filter + pp.Literal(')') | expr))('brace_expr')
    brace_expr.set_parse_action(brace_expr_action)

    conditional = (pp.Literal('and') | pp.Literal('or'))('conditional')

    filter << brace_expr + pp.ZeroOrMore(conditional + brace_expr)
    filter.set_parse_action(filter_action)

    filter_end = (filter + pp.string_end)('filter')

    parse = filter_end.parse_string(inp)
    return parse


examples = [
    "hello eq 'world'",
    "(hello eq 'world')",
    "not start gt 'neg'",
    "not (start gt 'something')",
    "starttt gt 'nowww' and end lt '123-  :'",
    "not start gt 'something'",
    "z ne '0' or (not a lt 'b') and not (c eq 'd' or k lt '1') ",
    "zoo eq 'dtsprod' and start_time le '2016-07-20 03:31:55'",
    "zoo eq 'dtsprod' and not start_time le '2016-07-20 03:31:55'",
    "(((zoo eq '1')))",
    "(not((zoo eq '1')))",
]

for i, inp in enumerate(examples):
    print('Case', i+1)
    print('input string', inp)
    parse = parse_grammar(inp)
    print('result', parse)
    print()


Case 1
input string hello eq 'world'
brace_expr_action [Expr(sql='"%s" = %s', params=['hello', 'world'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['hello', 'world'])]
result [Filter(sql='  "%s" = %s', params=['hello', 'world'])]

Case 2
input string (hello eq 'world')
brace_expr_action [Expr(sql='"%s" = %s', params=['hello', 'world'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['hello', 'world'])]
brace_expr_action ['(', Filter(sql='  "%s" = %s', params=['hello', 'world']), ')']
filter_action [BraceExpr(sql=' (   "%s" = %s )', params=['hello', 'world'])]
result [Filter(sql='  (   "%s" = %s )', params=['hello', 'world'])]

Case 3
input string not start gt 'neg'
brace_expr_action ['not', Expr(sql='"%s" > %s', params=['start', 'neg'])]
filter_action [BraceExpr(sql=' not "%s" > %s', params=['start', 'neg'])]
result [Filter(sql='  not "%s" > %s', params=['start', 'neg'])]

Case 4
input string not (start gt 'something')
brace_expr_action [Expr(sql='"%s" > %s', params=['start', 'something'])]
filter_action [BraceExpr(sql=' "%s" > %s', params=['start', 'something'])]
brace_expr_action ['not', '(', Filter(sql='  "%s" > %s', params=['start', 'something']), ')']
filter_action [BraceExpr(sql=' not (   "%s" > %s )', params=['start', 'something'])]
result [Filter(sql='  not (   "%s" > %s )', params=['start', 'something'])]

Case 5
input string starttt gt 'nowww' and end lt '123-  :'
brace_expr_action [Expr(sql='"%s" > %s', params=['starttt', 'nowww'])]
brace_expr_action [Expr(sql='"%s" < %s', params=['end', '123-  :'])]
filter_action [BraceExpr(sql=' "%s" > %s', params=['starttt', 'nowww']), 'and', BraceExpr(sql=' "%s" < %s', params=['end', '123-  :'])]
result [Filter(sql='  "%s" > %s and  "%s" < %s', params=['starttt', 'nowww', 'end', '123-  :'])]

Case 6
input string not start gt 'something'
brace_expr_action ['not', Expr(sql='"%s" > %s', params=['start', 'something'])]
filter_action [BraceExpr(sql=' not "%s" > %s', params=['start', 'something'])]
result [Filter(sql='  not "%s" > %s', params=['start', 'something'])]

Case 7
input string z ne '0' or (not a lt 'b') and not (c eq 'd' or k lt '1')
brace_expr_action [Expr(sql='"%s" != %s', params=['z', '0'])]
brace_expr_action ['not', Expr(sql='"%s" < %s', params=['a', 'b'])]
filter_action [BraceExpr(sql=' not "%s" < %s', params=['a', 'b'])]
brace_expr_action ['(', Filter(sql='  not "%s" < %s', params=['a', 'b']), ')']
brace_expr_action [Expr(sql='"%s" = %s', params=['c', 'd'])]
brace_expr_action [Expr(sql='"%s" < %s', params=['k', '1'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['c', 'd']), 'or', BraceExpr(sql=' "%s" < %s', params=['k', '1'])]
brace_expr_action ['not', '(', Filter(sql='  "%s" = %s or  "%s" < %s', params=['c', 'd', 'k', '1']), ')']
filter_action [BraceExpr(sql=' "%s" != %s', params=['z', '0']), 'or', BraceExpr(sql=' (   not "%s" < %s )', params=['a', 'b']), 'and', BraceExpr(sql=' not (   "%s" = %s or  "%s" < %s )', params=['c', 'd', 'k', '1'])]
result [Filter(sql='  "%s" != %s or  (   not "%s" < %s ) and  not (   "%s" = %s or  "%s" < %s )', params=['z', '0', 'a', 'b', 'c', 'd', 'k', '1'])]

Case 8
input string zoo eq 'dtsprod' and start_time le '2016-07-20 03:31:55'
brace_expr_action [Expr(sql='"%s" = %s', params=['zoo', 'dtsprod'])]
brace_expr_action [Expr(sql='"%s" <= %s', params=['start_time', '2016-07-20 03:31:55'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['zoo', 'dtsprod']), 'and', BraceExpr(sql=' "%s" <= %s', params=['start_time', '2016-07-20 03:31:55'])]
result [Filter(sql='  "%s" = %s and  "%s" <= %s', params=['zoo', 'dtsprod', 'start_time', '2016-07-20 03:31:55'])]

Case 9
input string zoo eq 'dtsprod' and not start_time le '2016-07-20 03:31:55'
brace_expr_action [Expr(sql='"%s" = %s', params=['zoo', 'dtsprod'])]
brace_expr_action ['not', Expr(sql='"%s" <= %s', params=['start_time', '2016-07-20 03:31:55'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['zoo', 'dtsprod']), 'and', BraceExpr(sql=' not "%s" <= %s', params=['start_time', '2016-07-20 03:31:55'])]
result [Filter(sql='  "%s" = %s and  not "%s" <= %s', params=['zoo', 'dtsprod', 'start_time', '2016-07-20 03:31:55'])]

Case 10
input string (((zoo eq '1')))
brace_expr_action [Expr(sql='"%s" = %s', params=['zoo', '1'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['zoo', '1'])]
brace_expr_action ['(', Filter(sql='  "%s" = %s', params=['zoo', '1']), ')']
filter_action [BraceExpr(sql=' (   "%s" = %s )', params=['zoo', '1'])]
brace_expr_action ['(', Filter(sql='  (   "%s" = %s )', params=['zoo', '1']), ')']
filter_action [BraceExpr(sql=' (   (   "%s" = %s ) )', params=['zoo', '1'])]
brace_expr_action ['(', Filter(sql='  (   (   "%s" = %s ) )', params=['zoo', '1']), ')']
filter_action [BraceExpr(sql=' (   (   (   "%s" = %s ) ) )', params=['zoo', '1'])]
result [Filter(sql='  (   (   (   "%s" = %s ) ) )', params=['zoo', '1'])]

Case 11
input string (not((zoo eq '1')))
brace_expr_action [Expr(sql='"%s" = %s', params=['zoo', '1'])]
filter_action [BraceExpr(sql=' "%s" = %s', params=['zoo', '1'])]
brace_expr_action ['(', Filter(sql='  "%s" = %s', params=['zoo', '1']), ')']
filter_action [BraceExpr(sql=' (   "%s" = %s )', params=['zoo', '1'])]
brace_expr_action ['not', '(', Filter(sql='  (   "%s" = %s )', params=['zoo', '1']), ')']
filter_action [BraceExpr(sql=' not (   (   "%s" = %s ) )', params=['zoo', '1'])]
brace_expr_action ['(', Filter(sql='  not (   (   "%s" = %s ) )', params=['zoo', '1']), ')']
filter_action [BraceExpr(sql=' (   not (   (   "%s" = %s ) ) )', params=['zoo', '1'])]
result [Filter(sql='  (   not (   (   "%s" = %s ) ) )', params=['zoo', '1'])]

At a high level, the different parser components are still very readable.

  • a filter is one or more brace_expr joined by conditionals (and/or)
  • a brace_expr is an optional negation of either a filter in brackets or a plain expr. A Group is used to make the parse tree more hierarchical instead of flat.
  • an expr is a group of var op val, where var is a word (alphanumeric), val is a quoted word (allows spaces and more symbols), op is an operator such eq or gt etc.

More observations:

  • operators written with letters such as eq and ne get converted to symbolic notation eg = and <.
  • the filters are recursive and allow grouping with brackets. Note how a forward object is used as a placeholder until it's defined later.
  • the output is an object with two attributes. sql is a string with variables and values replaced by %s placeholders. params is an array of the extracted parameters that will end up in the placeholders after SQL parameter binding.
  • I've added tracing for brace_expr_action and filter_action to make it obvious what they're trying process. Those actions have to process a mix of plain string tokens and parsed objects. Using parsing actions in this way can quickly create unmaintainable code. One alternative is to create objects for each string token (simulates lexer and parser stages). Another alternative is to accumulate results outside of the parse tree and leave everything as strings.
  • we use a trick to add spaces before each token, which creates a lot of useless whitespace in the output, but this way we avoid a lot of special cases in the code.

Even for a relatively simple problem, even with a tool that allegedly takes away some flexibility, we have to make a lot of decisions, there isn't one correct way to do it (potential for analysis paralysis), there's a lot going on and the code isn't trivial.

a little more theory

What are lexers and parsers? A simplistic explanation

A lexer is a component that processes text (or any other data stream) and splits it into tokens. Lexers usually have no understanding of the semantics of the text. Tokens are things like:

  • string constants eg words inside of quotes "text 123"
  • reserved keywords like class, function, if, else, while, import, return, true, false, null
  • decimal numbers like 123
  • hexadecimals like 0x2323
  • signs like + -

Lexers often are not context sensitive, so all tokens have to be unambiguous. Eg you don't have 123 as a reserved word, because then you can't have the decimal number 123. This is why variable identifiers in programming languages usually can't start with a digit.

Tokens usually have multiple properties that are useful.

  • a type (eg this token is a keyword)
  • a value (eg import)
  • position in the original text (line number, column number, distance from start or end of the text)

The list (or stream) of tokens is then given to the parser, which tries to match the tokens to the grammar rules in order to produce a parse tree. Some parsers might also run code (via actions, triggers, callbacks) in order to modify the parse tree as it is being built. A simple action might be to convert hexadecimals from text (0x2323) to decimal integers (8995). A more complex action might be to evaluate or reorder part of the tree to simplify it (eg to merge two consecutive string constants into one or to eliminate comments from the tree). We saw examples of actions in the filter language example with filter_action, brace_expr_action, expr_action. The parse tree generated by the parser is then given to the rest of the program for futher processing.

contextual lexing

A contextual lexer means it's not processing just regular languages, but it actively uses state (potentially the parser state) to decde how to create tokens. For example the parser is in a state where it expects a number, the lexer will not bother checking for punctuation, it will try to create a number token.

Contextual lexing is considered inelegant by purists, which is why it's also known as https://en.wikipedia.org/wiki/Lexer_hack. As with most inelegant solutions, your use case can be simple enough that the consequences don't matter.

A lack of context sensitivity in a lexer can frustrate even legends like Ken Thompson.

Seibel: And are there development tools that just make you happy to program?
Thompson: I love yacc. I just love yacc. It just does exactly what you want done. Its complement, lex, is horrible. It does nothing you want done.
Seibel: Do you use it anyway or do you write your lexers by hand?
Thompson: I write my lexers by hand. Much easier.

The quote is from the book Coders at Work; lex mostly works with regular expressions, while yacc starts out as a LALR parser which has recursion and works with higher-level semantics. Both types of tools have many implementations and extensions, so in practice the distinction isn't as clear-cut.

Another example is python, which famously has a context-sensitive lexer that has to count indentations in order to tokenize correctly for the parsing stage.

Note that lex is a lexer generator. Similarly yacc is a parser generator. They take as input a specification and their output is a specific lexer or parser created from the specification. Generators, parsing toolkits and libraries spare the user from implementing a lexer/parser on their own, but don't spare them from having the expertise to configure, use and debug them when they don't work as expected.

parsing algorithm shopping guide

As far as a user is concerned, they can choose parsing algorithms based on interrelated concepts such as

  • expressiveness or generality - types of grammars they can parse
  • ease of use, popularity, documentation. This includes error reporting - how understandable are the errors to a non-expert.
  • programming language bindings
  • implementation efficiency, memory usage, caching
  • how they handle ambiguity.
  • how they handle left or right recursion
  • how much lookahead is needed
  • how many passes over the data are acceptable

A non-expert usually can't tell what kind of grammar they are dealing with. In practice, the more useful it is, the more likely it's context sensitive. This is why when people attempt to parse HTML (which very obviously supports arbitrary nesting) with regular expressions, which are more appropriate for simpler grammars without nesting, end up hitting a wall or keep rewriting and extending their program to handle more and more cases. In fact HTML in full generality is so complex that the WHATWG doesn't even try to provide a grammar for it, but instead describes a parsing algorithm https://html.spec.whatwg.org/multipage/parsing.html.

Many of the algorithms have a (1) (k) (n) (*) after their name. The convention is that a digit or k are a constant amount of lookahead, eg (1) means 1 token of lookahead; n and * usually mean dynamic or arbitrary lookahead - possibly the whole input.

For ambiguity the common strategies are

  • error out
  • specify precedences, which is hard and extremely error prone for humans to do
  • try to use more lookahead
  • context sensitivity described earlier
  • generate independent parse forests instead of parse trees. You have to decide which one to use at a later stage. This can catch you off guard if there's exponential backtracking.

The popular algorithms work mainly with context free grammars. LL, GLL, LR, GLR, SLR, LALR, earley, CYK, PEG etc. It's better to pick something and start working on your problem than to sit around in analysis paralysis. You'll learn more about the problem you're trying to solve and the tools you're working with, which will also improve your decision making later on. If you don't want to use any premade parsing toolkit, it's perfectly acceptable to start working on your own hand written parser. It's very likely you'll reinvent recursive descent parsing or a variant of the other algorithms.

If your theory is stalling, do some more practice, it will improve your theory.
If your practice is stalling, do some more theory, it will improve your practice.

lexing then parsing with LALR(1) via the lark toolkit

Lark (https://lark-parser.readthedocs.io/en/latest/) has multiple lexer and parser implementations. The LALR(1) parser and its contextual lexer are considered to be the fastest ones provided in the library with the usual tradeoff of less flexible ambiguty resolution.

The basic way to use the parser generator is to initalize it with a custom grammar. It includes the specification for both how to lex tokens and how to combine them into more complex rules.


import lark

grammar = r"""

start: timestamp "  " line NEWLINE

line: hello_msg | goodbye_msg

hello_msg: "hello"
goodbye_msg: "bye"

timestamp: INT "/" INT " " INT ":" INT ":" INT "." INT

CR : /\r/
LF : /\n/
NEWLINE: (CR? LF)+

DIGIT: "0".."9"
INT: DIGIT+
"""

parser = lark.Lark(grammar, parser='lalr')
line = "11/23 21:14:24.715  hello\n\n"
tree = parser.parse(line)
print(tree.pretty())



output:
start
  timestamp
    11
    23
    21
    14
    24
    715
  line
    hello_msg

The grammar is pretty straightforward

  • parsing starts from the start rule
  • the start rule is a timestamp rule followed by 2 spaces, followed by the line rule, followed by a newline token
  • the line rule is either a goodbye or hello message
  • the timestamp rule is a combination of punctuation and integers tokens
  • integer tokens are 1 or more digit tokens
  • digits are from 0 to 9 inclusive
  • the newline token matches one or more optional carriage returns \r followed by a newline \n

The uppercase identifiers like NEWLINE, INT, DIGIT are the lexer rules. Those identifiers create named tokens. There are also some anonymous tokens such as "hello" and "goodbye".
The lowercase identifiers like start, line, timestamp are parser rules. They are made from other parser rules or lexer rules.
In other words the lexer tokens (aka terminal symbols) end up as leaves of the parse tree and above them are the rules that combine them (aka production rules, nonterminal symbols).

Note the rule order can be top to bottom, the opposite of the parser combinator example.

Every small detail of the text to be parsed is specified. At the same time we didn't have to specify anything about how to implement a lexer or a parser.

A more detailed example is available at https://github.com/melbaa/summarize_consumes/tree/master/src/melbalabs/summarize_consumes

Lark has interesting ways to debug and modify the parser behavior. To quote its author

With the InteractiveParser class, Lark introduces a completely new way to interact with your parser. The traditional design of parsers dictates that they run as a closed loop. You start them, and they run as independent actors, calling your callbacks as they see fit. Then the loop ends, and the parser exits. A parser isn't supposed to be paused. Like the abyss, it isn't meant to be peered into.
The interactive parser is a living object that you act upon. You can feed it tokens one by one, observe its state, and even make fully-functional copies of it. It gives you complete control to define your own parsing logic. For example, you could use it to add context-sensitive logic, or to implement backtracking (by making copies at checkpoints), or to debug your parser from the Python console. It also makes error handling a LOT easier.

conclusion and next steps

We saw multiple ways to approach parsing - what is possible, what to expect, what to look for.

So what's next?
The wikipedia articles on the mentioned concepts have more rigorous and complete explanations and are full of references.
Practice a lot.
Investigate how other projects solve their parsing problems.