Python USES the Spark module for the tutorial

  • 2020-05-07 20:01:10
  • OfStack

In everyday programming,   often requires me to identify parts and structures that exist in text documents: log files, configuration files, delimited data, and a more free-form (but still semi-structured) report format. All of these documents have their own "little language" for specifying what can appear in the document. My approach to writing programs for these informal parsing tasks has always been a bit of a hodgepodge, including custom state machines, regular expressions, and context-driven string tests. The pattern in these programs is always something like this: "read some text, figure out if you can do something with it, then maybe read some more text, and try again."

The parser distills the description of the parts and structures in the document into concise, clear, and prescriptive rules that determine what constitutes the document. Most formal parsers use variants on the extended bacus paradigm (Extended Backus-Naur Form, EBNF) to describe the "syntax" of the language they describe. Basically, the EBNF syntax gives names to parts that you might find in a document; In addition, larger parts usually consist of smaller parts. The frequency and order in which widgets appear in larger parts is specified by the operator. For example, listing 1 is the EBNF syntax typographify.def, which we saw in the SimpleParse article (the other tools work a little differently) :

Listing 1. typographify. def


para    := (plain / markup)+
plain    := (word / whitespace / punctuation)+
whitespace := [ \t\r\n]+
alphanums  := [a-zA-Z0-9]+
word    := alphanums, (wordpunct, alphanums)*, contraction?
wordpunct  := [-_]
contraction := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve')
markup   := emph / strong / module / code / title
emph    := '-', plain, '-'
strong   := '*', plain, '*'
module   := '[', plain, ']'
code    := "'", plain, "'"
title    := '_', plain, '_'
punctuation := (safepunct / mdash)
mdash    := '--'
safepunct  := [!@#$%^&()+=|\{}:;<>,.?/"]

About Spark

The Spark parser has something in common with the EBNF syntax, but it breaks the parsing/processing into smaller components than the traditional EBNF syntax allows. The advantage of Spark is that it fine-tunes the controls for each step of the process and provides the ability to insert custom code into the process. If you've read the SimpleParse article in this series, you'll recall that our process was sketchy: 1) generate a complete list of tags from the syntax (and from the source), and 2) use tag lists as data for custom programming operations.

The disadvantage of Spark compared to standard EBNF-based tools is that it is verbatim and lacks direct occurrence metrics (i.e., a "+" for presence, a "*" for possibility, and a "?" for restriction). ). Quantifiers can be used in regular expressions in the Spark tokenizer (tokenizer) and can be simulated with recursion in the parsing expression syntax. It would be even better if Spark allowed the use of metrics in syntactic expressions. Another drawback worth mentioning is that the Spark's speed pales in comparison to the C-based underlying mxTextTools engine used by SimpleParse.

In "Compiling Little Languages in Python" (see resources), John Aycock, the founder of Spark, divides the compiler into four phases. The problems discussed in this article cover only the first two and a half phases, which is due to two reasons: 1 because of the length of the article, and 2 because we will only discuss the same relatively simple "text markup" problem raised in the previous article. Spark can also be used one step further as a full-cycle code compiler/interpreter, rather than just for the "parse and process" tasks I've described. Let's take a look at the four phases described by Aycock (abridged) :

      scanning, also known as lexical analysis. Divide the input stream into 1 column notation.       parsing, also known as parsing. Ensure that the list of tokens is syntactically valid.       semantic analysis. Walk the abstract syntax tree (abstract syntax tree, AST) once or more to collect information and check the input program makes sense.       generates code. Traversing the AST again, this phase may interpret the program or output code directly with C or assembly.

For each phase, Spark provides one or more abstract classes to perform the steps and a rare protocol to specialize the classes. Spark concrete classes do not simply redefine or add specific methods, as most classes in inheritance patterns do, but rather have two features (a 1-like pattern as opposed to each phase and various parent patterns). First, most of what the concrete class does is specified in the method's documentation string (docstring). The second special protocol is that the set of methods describing the pattern will be given a unique name indicating its role. The parent class, in turn, contains introspection (introspective) methods that find the functionality of the instance to operate on. We'll see this point more clearly when we look at the example.

recognizes the text tag

I've solved the problem here in several other ways. I use a format I call "smart ASCII" for a variety of purposes. This format looks a lot like the protocols developed for E-mail and newsgroup communications. For various purposes, I automatically convert this format to other formats, such as HTML, XML, and LaTeX. I'm going to do that one more time here. To give you a sense of what I mean, I'll use the following short sample in this article:

Listing 2. Smart ASCII sample text (p.txt)

Text with *bold*, and -itals phrase-, and [module]--this
should be a good 'practice run'.

In addition to the content in the sample file, there is one additional point about format, but not much (although there is one small point about how markup and punctuation interact).

Generate the token

The first thing our Spark "smart ASCII" parser needs to do is break the input text into related parts. At this level of tokenization, we don't want to talk about how to construct the tokens, just leave them as they are. Later we will combine the token sequence into the parse tree.

The syntax shown in typographify.def above provides design guidelines for Spark lexical analyzer/scanner. Note that we can only use names that are "primitives" in the scanner phase. That is, those (composite) patterns that include other named patterns must be deferred during the parsing phase. In addition, we can actually copy the old syntax directly.

Listing 3. The truncated wordscanner.py Spark script

     


class WordScanner(GenericScanner):
  "Tokenize words, punctuation and markup"
  def tokenize(self, input):
    self.rv = []
    GenericScanner.tokenize(self, input)
    return self.rv
  def t_whitespace(self, s):
    r" [ \t\r\n]+ "
    self.rv.append(Token('whitespace', ' '))
  def t_alphanums(self, s):
    r" [a-zA-Z0-9]+ "
    print "{word}",
    self.rv.append(Token('alphanums', s))
  def t_safepunct(self, s): ...
  def t_bracket(self, s): ...
  def t_asterisk(self, s): ...
  def t_underscore(self, s): ...
  def t_apostrophe(self, s): ...
  def t_dash(self, s): ...
class WordPlusScanner(WordScanner):
  "Enhance word/markup tokenization"
  def t_contraction(self, s):
    r" (?<=[a-zA-Z])'(am|clock|d|ll|m|re|s|t|ve) "
    self.rv.append(Token('contraction', s))
  def t_mdash(self, s):
    r' -- '
    self.rv.append(Token('mdash', s))
  def t_wordpunct(self, s): ...

Here's an interesting thing. WordScanner itself is a perfect scanner class; But the Spark scanner class itself can be specialized one step further by inheritance: the child regular expression pattern matches before the parent regular expression, and the child method/regular expression can override the parent method/regular expression if needed. So, WordPlusScanner will match the specialization before WordScanner (and may therefore get a few bytes first). Any regular expression is allowed in the schema docstring (for example, the.t_contraction () method contains one "backward insert" in the schema).

Unfortunately, Python 2.2 somehow breaks the scanner inheritance logic. In Python 2.2, all defined patterns are matched alphabetically (by name), regardless of where they are defined in the inheritance chain. To fix this, you can change 1 line of code in the Spark function _namelist() :

Listing 4. The spark.py function after correction


  def _namelist(instance):
  namelist, namedict, classlist = [], {}, [instance.__class__]
  for c in classlist:
    for b in c.__bases__:
      classlist.append(b)
    # for name in dir(c):  # dir() behavior changed in 2.2
    for name in c.__dict__.keys(): # <-- USE THIS
      if not namedict.has_key(name):
        namelist.append(name)
        namedict[name] = 1
  return namelist

I have informed Spark founder John Aycock of this problem and it will be fixed in a future release. In the meantime, please make changes in your own copy.

Let's see what happens when WordPlusScanner is applied to the "smart ASCII" sample above. The list it creates is actually a list of 1 Token instance, but they contain 1. S 149en__ method, which allows them to display the following information well:

Listing 5. Mark "smart ASCII" with WordPlusScanner

> > > from wordscanner import WordPlusScanner
> > > tokens = WordPlusScanner().tokenize(open('p.txt').read())
> > > filter(lambda s: s < > 'whitespace', tokens)
[Text, with, *, bold, *, ,, and, -, itals, phrase, -, ,, and, [,
module, ], --, this, should, be, a, good, ', practice, run, ', .]

It is worth noting that although methods such as.t_alphanums () are recognized by Spark introspection based on the prefix "t_", they are regular methods. Any additional code in the method is executed whenever the corresponding token is encountered. The.t_alphanums () method contains a small example of this point, including an print statement.

generates an abstract syntax tree

Finding tokens is interesting, but what's really interesting is how to apply syntax to a list of tokens. The parsing phase creates an arbitrary tree structure based on a list of tokens. It just specifies the expression syntax.

Spark has several ways to create AST. The "manual" method is to specialize the GenericParser class. In this case, the concrete child parser provides a number of methods in the form p_foobar(self, args). The documentation string for each such method contains one or more schema-to-name assignments. Each method can contain any code to execute as long as the syntax expression matches.

However, Spark also provides a way to "automatically" generate AST. This style is inherited from the GenericASTBuilder class. All the syntax expressions are listed in one of the highest-level methods, and the.terminal () and.nonterminal () methods can be specialized to operate on the subtree during generation (or any other operation if needed). The result is still AST, but the parent class does most of the work for you. My syntax class is similar to the following:

Listing 6. The truncated markupbuilder.py Spark script


    class MarkupBuilder(GenericASTBuilder):
  "Write out HTML markup based on matched markup"
  def p_para(self, args):
    '''
    para  ::= plain
    para  ::= markup
    para  ::= para plain
    para  ::= para emph
    para  ::= para strong
    para  ::= para module
    para  ::= para code
    para  ::= para title
    plain ::= whitespace
    plain ::= alphanums
    plain ::= contraction
    plain ::= safepunct
    plain ::= mdash
    plain ::= wordpunct
    plain ::= plain plain
    emph  ::= dash plain dash
    strong ::= asterisk plain asterisk
    module ::= bracket plain bracket
    code  ::= apostrophe plain apostrophe
    title ::= underscore plain underscore
    '''
  def nonterminal(self, type_, args):
    # Flatten AST a bit by not making nodes if only one child.
    if len(args)==1: return args[0]
    if type_=='para': return nonterminal(self, type_, args)
    if type_=='plain':
      args[0].attr = foldtree(args[0])+foldtree(args[1])
      args[0].type = type_
      return nonterminal(self, type_, args[:1])
    phrase_node = AST(type_)
    phrase_node.attr = foldtree(args[1])
    return phrase_node

My.p_para () should contain only one set of syntax rules (no code) in its documentation string. I decided to use.nonterminal () specifically to tile AST slightly. The "plain" node, which consists of a series 1 "plain" subtree, compresses the subtree into a longer string. Similarly, the token subtree (i.e. "emph", "strong", "module", "code", and "title") collapses into a single node of the correct type and contains a composite string.

As we have already mentioned, one thing is clearly missing from the Spark syntax: there is no metrology. By following these rules,

plain ::= plain plain

We can aggregate "plain" subtrees in pairs. However, I prefer Spark to allow a more EBNF-style syntax expression, as shown below:

plain ::= plain+

Then we can more easily create an n-ary subtree that says "plain as much as possible." In this case, our tree is easier to start the column without even passing the message in.nonterminal ().

USES tree

The Spark module provides several classes that use AST. These responsibilities are greater than I need for my purposes. If you want them, GenericASTTraversal and GenericASTMatcher provide a way to traverse the tree, using inheritance protocols similar to those we provide for scanners and parsers.

But traversing the tree using only recursive functions is not ten percent difficult. I created some of these examples in the article's zip file prettyprint.py (see resources). One of them is showtree(). This function displays an AST with several conventions.

Each row of       shows the depth of descent       only nodes with children (no content) begin with a dash The       node type is enclosed in double Angle brackets

Let's take a look at the AST generated in the above example:

Listing 7. Mark "smart ASCII" with WordPlusScanner


 >>> from wordscanner import tokensFromFname
>>> from markupbuilder import treeFromTokens
>>> from prettyprint import showtree
>>> showtree(treeFromTokens(tokensFromFname('p.txt')))
 0 <<para>>
 1 - <<para>>
 2 -- <<para>>
 3 --- <<para>>
 4 ---- <<para>>
 5 ----- <<para>>
 6 ------ <<para>>
 7 ------- <<para>>
 8 -------- <<plain>>
 9      <<plain>> Text with
 8     <<strong>> bold
 7 ------- <<plain>>
 8     <<plain>> , and
 6    <<emph>> itals phrase
 5 ----- <<plain>>
 6    <<plain>> , and
 4   <<module>> module
 3 --- <<plain>>
 4   <<plain>> --this should be a good
 2  <<code>> practice run
 1 - <<plain>>
 2  <<plain>> .

Understanding the tree structure is straightforward, but what about the modified markup we're really looking for? Fortunately, it only takes a few lines of code to traverse the tree and generate it:

Listing 8. Output the tag from AST (prettyprint.py)


  def emitHTML(node):
  from typo_html import codes
  if hasattr(node, 'attr'):
    beg, end = codes[node.type]
    sys.stdout.write(beg+node.attr+end)
  else: map(emitHTML, node._kids)

The typo_html.py file is like the one in the SimpleParse article -- it just contains a dictionary that maps names to start/end tag pairs. Obviously, we can use the same method for markup other than HTML. If you're not sure, here's what our example will produce:

Listing 9. HTML output of the entire process

Text with < strong > bold < /strong > , and < em > itals phrase < /em > ,
and < em > < code > module < /code > < /em > --this should be a good
< code > practice run < /code > .

conclusion

Many programmers at Python recommend Spark to me. While Spark USES a rare protocol that is difficult to get used to and the documentation can be confusing from some perspectives, the power of Spark is surprising. The programming style implemented by Spark enables the end programmer to insert blocks of code anywhere during the scan/parse process -- often a "black box" for the end user.

For all its advantages, I find the Spark's real drawback is its speed. Spark is the first Python program I've ever used, and I've found that the speed loss of the interpreted language is a major problem. The Spark is really slow; It's not just "I wish I could be a little faster", it's "I had a long lunch and I wish it was over". In my experiments, the tokenizer was fast, but the parsing process was slow, even with small test cases. To be fair, John Aycock has pointed out to me that the Earley parsing algorithm used by Spark is much more comprehensive than the simpler LR algorithm, which is the main reason why it is slow. It's also possible that, because of my inexperience, I might design an inefficient grammar; But even then, most users are likely to be like me.


Related articles: