Tutorial for parsing using the SimpleParse module in Python

2020-05-09 18:46:37
OfStack

Like most programmers, I often need to identify parts and structures that exist in text documents: log files, configuration files, delimited data, and a more free-form (but still semi-structured) report format. All of these documents have their own "little language" for specifying what can appear in the document.

My approach to writing programs that handle these informal parsing tasks has always been something of a hodgepodgeof custom state machines, regular expressions, and context-driven string tests. The pattern in these programs is always something like this: "read some text, figure out if you can do something with it, then maybe read some more text, and keep trying."

Various forms of parsers refine descriptions of parts and structures in documents into concise, clear, and declarative rules that dictate how to identify the components of a document. Here, the illustrative aspect is the most striking. All of my old special parsers used this style: read 1 character, make a decision, add 1 variable, empty, repeat. As we observed in the functional programming section of this column, the method style of program flow is relatively error-prone and difficult to maintain.

Formal parsers almost always use variants on the extended bacus paradigm (Extended Backus-Naur Form (EBNF)) to describe the "syntax" of the language they describe. The tools we're looking at here do just that, as do the popular compiler development tool YACC (and its variants). Basically, the EBNF syntax gives names to parts that you might find in a document; In addition, smaller parts are often combined into larger parts. The frequency and order in which widgets appear in larger parts is specified by the operator and usually the same symbols you see in regular expressions. In parser chat (parser-talk), each named part in the syntax is called a "product (production)."

The reader may not even know EBNF, but has already seen EBNF described in action. For example, the familiar Python reference (Python Language Reference) defines what floating point Numbers look like in Python:
Floating point number description in EBNF style

Or you may have seen the XML DTD element defined in the EBNF style. For example, the developerWorks tutorial < body > Similar to:
A description of the EBNF style in developerWorks DTD

<!ELEMENT body  ((example-column | image-column)?, text-column) >

Spelling is slightly different, but the 1-like concepts of quantification, alternation, and ordering are present in all EBNF style language grammars.
Use SimpleParse to build the tag list

SimpleParse is an interesting tool. To use this module, you need the underlying module mxTextTools, which implements a "tagging engine" with C. mxTextTools (see resources later in this article) is powerful, but rather difficult to use. Once SimpleParse is placed on mxTextTools, the work becomes much easier.

It's really easy to use SimpleParse, because you don't need to consider most of the complexity of mxTextTools. First, you should create an EBNF style syntax that describes the language you want to process. Step 2 is to call mxTextTools to create a tag list that describes all successful products when the syntax is applied to the document. Finally, use the list of tags returned by mxTextTools to do the actual work.

For the purposes of this article, the "language" we'll parse is the set of markup code used by "smart ASCII" to represent things like bold, module names, and book titles. This is the same language that was previously identified with mxTextTools, using regular expressions and state machines in the previous section. The language is much simpler than a full programming language, but it is complex enough to be representative.

In this case, we might want to review 1. What is the "tag list" that mxTextTools gives us? This is basically a nested structure that simply gives the character offset that each product matches in the source text. mxTextTools quickly traverses the source text, but it does nothing to the source text itself (at least not when SimpleParse syntax is used). Let's explore a simplified tag list:
A list of tags generated from the SimpleParse syntax


(1,
 [('plain',
  0,
  15,
  [('word', 0, 4, [('alphanums', 0, 4, [])]),
  ('whitespace', 4, 5, []),
  ('word', 5, 10, [('alphanums', 5, 10, [])]),
  ('whitespace', 10, 11, []),
  ('word', 11, 14, [('alphanums', 11, 14, [])]),
  ('whitespace', 14, 15, [])]),
 ('markup',
  15,
  27,
 ...
 289)

The ellipsis in the middle indicates more matches for 1 batch. But the section that we see states the following. The root product (" para ") succeeds and ends at offset 289 (the length of the source text). The offsets of the child product "plain" range from 0 to 15. The "plain" subproduct itself consists of smaller products. After the "plain" product, the "markup" product has an offset of 15 to 27. The details are omitted here, but the first "markup" consists of components, and there are other products that succeed later in the source text.

EBNF style syntax for "smart ASCII"

We have browsed the list of tags that SimpleParse + mxTextTools can provide. But we do need to explore the syntax used to generate this list of tags. The actual work takes place in the grammar. The EBNF grammar reads with little explanation (although it does require a bit of thought and testing to design a grammar) :
typographify.def


para      := (plain / markup)+
plain     := (word / whitespace / punctuation)+
whitespace   := [ \t\r\n]+
alphanums   := [a-zA-Z0-9]+
word      := alphanums, (wordpunct, alphanums)*, contraction?
wordpunct   := [-_]
contraction  := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')
markup     := emph / strong / module / code / title
emph      := '-', plain, '-'
strong     := '*', plain, '*'
module     := '[', plain, ']'
code      := "'", plain, "'"
title     := '_', plain, '_'
punctuation  := (safepunct / mdash)
mdash     := '--'
safepunct   := [!@#$%^&()+=|\{}:;<>,.?/"]

This syntax is almost identical to the way you would verbally describe "smart ASCII," which is very clear. A paragraph consists of some plain text and some markup text. Plain text consists of a collection of certain words, whitespace, and punctuation marks. Markup text can be text highlighting, text highlighting, module names, and so on. Emphasize that the text is surrounded by asterisks. The markup text is made up of such parts. There are several features to consider, such as what exactly is a "word" or what symbols can be used to end an abbreviation, but the syntax of EBNF is not an obstacle.

In contrast, regular expressions can be used to describe homogeneous rules more succinctly. The first version of the "smart ASCII" tag did just that. But this refinement is much harder to write and more difficult to adapt later. The following code represents a largely (but imprecisely) identical set of rules:
Smart ASCII Python regexs


# [module] names
    
re_mods =  
    r""'([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;'"?!/-])"""
# *strongly emphasize* words
    
re_strong = 
    r""'([\(\s'/"]|^)\*(.*?)\*([\s\.\),:;'"?!/-])"""
# -emphasize- words
    
re_emph =  
    r""'([\(\s'/"]|^)-(.*?)-([\s\.\),:;'"?!/])"""
# _Book Title_ citations
    
re_title = 
    r""'([\(\s'/"]|^)_(.*?)_([\s\.\),:;'"?!/-])"""
# 'Function()" names
    
re_funcs = 
    r""'([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])"""

If you discover or invent some slightly updated variant of the language, it's much easier to use it with EBNF syntax 1 than with those regular expressions 1. In addition, mxTextTools is often used to perform even faster mode operations.

Generate and use a list of tags

For the sample program, we put the actual syntax in a separate file. For most purposes, this organization is better and easier to use. In general, changing the syntax and changing the application logic are different kinds of tasks; These documents reflect this point. But all we do with the syntax is pass it to the SimpleParse function as a string, so we can generally include it in the main application (or even dynamically generate it in some way).

Let's explore the complete (simplified) markup application:
typographify.py


import
     os
    from
     sys 
    import
     stdin, stdout, stderr
    from
     simpleparse 
    import
     generator
    from
     mx.TextTools 
    import
     TextTools
input = stdin.read()
decl = open(
    'typographify.def'
    ).read()
    from
     typo_html 
    import
     codes
parser = generator.buildParser(decl).parserbyname(
    'para'
    )
taglist = TextTools.tag(input, parser)
    for
     tag, beg, end, parts 
    in
     taglist[1]:
  
    if
     tag == 
    'plain'
    :
    stdout.write(input[beg:end])
  
    elif
     tag == 
    'markup'
    :
    markup = parts[0]
    mtag, mbeg, mend = markup[:3]
    start, stop = codes.get(mtag, (
    '<!-- unknown -->'
    ,
    '<!-- / -->'
    ))
    stdout.write(start + input[mbeg+1:mend-1] + stop)
stderr.write(
    'parsed %s chars of %s\n'
     % (taglist[-1], len(input)))

That's what it does. First read in the grammar, and then create an mxTextTools parser based on the grammar. Next, we apply the tag table/parser to the input source to create a tag list. Finally, we loop through the tag list and emit some new tag text. Of course, this loop can do anything else we want for each product we encounter.

Because of the special syntax used by intelligent ASCII, anything in the source text can be categorized as either an "plain" product or an "markup" product. Therefore, it is sufficient to loop through a single level in the tag list (unless we are looking for a level that is one level below the level of a particular tag product, such as "title"). But more free-form syntax like the ones that appear in most programming languages can be easily recursed down the tag list and looked for the product name at each level. For example, you might use this recursive style if one syntax allows nested markup code. You might like the exercise of figuring out how to adjust the syntax (hint: remember to allow products to recurse on each other).

The special markup code that goes to the output is still stored in another file for organizational rather than intrinsic reasons. Here we use a trick of using a dictionary as an switch statement (although the otherwise case in the example is still too narrow). The idea is that in the future we may want to create files in multiple "output formats," such as HTML, DocBook, LaTeX, or other formats. The special tag file for the example looks like:
typo_html.py


codes = \
{ 
    'emph'
      : (
    '<em>'
    , 
    '</em>'
    ),
 
    'strong'
     : (
    '<strong>'
    , 
    '</strong>'
    ),
 
    'module'
     : (
    '<em><code>'
    , 
    '</code></em>'
    ),
 
    'code'
      : (
    '<code>'
    , 
    '</code>'
    ),
 
    'title'
      : (
    '<cite>'
    , 
    '</cite>'
    ),
}

Extending this format to other output formats is simple.

conclusion

SimpleParse provides a concise and easy-to-read EBNF style wrapper for the basic functionality and speed of the ambiguous mxTextTools C module. In addition, many programmers are already quite familiar with the EBNF syntax, even if they learned it in passing. As to what is easier to understand, I cannot provide proof that this point varies by intuition, but I can give a quantitative assessment based on the source code length. The size of the previously hand-developed mxTypographify module is as follows:

wc mxTypographify.py

199 776 7041 mxTypographify.py

Of these 199 lines, a significant number are comments. Eighteen of these lines are the version of the regular expression contained in the tag function, which is included for timing comparisons. However, the function of this program is basically the same as typographify.py listed above. In contrast, our SimpleParse program, including its supporting files, is as follows:

wc typo*.def typo*.py
19      79     645 typographify.def

20      79     721 typographify.py

 6      25     205 typo_html.py

45     183    1571 total

In other words, there are only about one-fourth as many rows. This version has fewer comments, but that's mainly because the EBNF syntax is very self-describing. I don't want to emphasize lines of code too much. Obviously, you can play games by minimizing or maximizing code length. However, a few empirical studies of programmers' work show that "thousands of lines of code/person/month" is fairly close to a constant, and has little to do with languages and libraries. Of course, in turn, the regular expression version is one-third the length of the SimpleParse version, but I think the expression density makes it extremely difficult to maintain and even harder to write. All in all, I think SimpleParse is the best method under consideration.