An introductory tutorial on the use of some of the natural language tools in Python

2020-05-09 18:49:56
OfStack

NLTK is an excellent tool for teaching and practicing computational linguistics using Python. In addition, computational linguistics is closely related to areas such as artificial intelligence, language/specialized language recognition, translation, and grammar checking.
What does NLTK include

NLTK would naturally be viewed as series 1 layers with a stack structure built on top of each other. Readers who are familiar with the grammar and parsing of artificial languages, such as Python, will not have much difficulty understanding the similar - but more esoteric - layers of the natural language model.
The glossary

Universal (Corpora) : a collection of related texts. For example, Shakespeare's works may be collectively referred to as an anthology (corpus); The works of several authors are called the complete works.

Histogram (Histogram) : a statistical distribution of the frequency of different words, letters, or other entries in the dataset.

Structure (Syntagmatic) : a study of segments; This is the number of consecutive occurrences of letters, words, or phrases in a set.

Context free grammar (Context-free grammar) : class 2 in the Noam Chomsky hierarchy of four types of formal grammars. See resources for a detailed description.

Although NLTK comes with a number of full sets that have been preprocessed (usually manually) to varying degrees, each layer is conceptually dependent on the processing of the next lower level. The first is hyphenation; Then label the words; Groups of words are then parsed into grammatical elements, such as noun phrases or sentences (depending on one of several techniques, each of which has its advantages and disadvantages). Finally, categorize the final statement or other grammar units. With these steps, NLTK allows you to generate statistics about the occurrence of different elements and draw a diagram that describes the process itself or the result of a statistical sum-up.

In this article, you'll see some relatively complete examples of lower-level capabilities, while most of the higher-level capabilities will be described in the abstract. Now let's take a closer look at the first steps of text processing.

Hyphenation (Tokenization)

Much of what you can do with NLTK, especially at the lower levels, is not so different from what you can do with Python's basic data structures. However, NLTK provides a set of systematic interfaces that are relied upon and used by higher layers, rather than simply providing useful classes for handling tagged or tagged text.

Specifically, the nltk.tokenizer.Token class is widely used to store annotated fragments of text; These annotations can mark many different features, including the part of speech (parts-of-speech), the subsign (subtoken) structure, the offset of a sign (token) in a larger text, the morphological stem (morphological stems), the grammatical statement component, and so on. In fact, an Token is a special kind of dictionary -- and is accessed as a dictionary -- so it can accommodate any key you want. A number of specialized keys are used in NLTK, and different keys are used by different subroutines.

Let's briefly examine 1 how to create a flag and break it down into subflags:
Listing 1. First acquaintance with the nltk.tokenizer.Token class


>>> from nltk.tokenizer import *
>>> t = Token(TEXT='This is my first test sentence')
>>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace
>>> print t['TEXT']
This is my first test sentence
>>> print t['SUBTOKENS']
[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]
>>> t['foo'] = 'bar'
>>> t
<TEXT='This is my first test sentence', foo='bar',
SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]>
>>> print t['SUBTOKENS'][0]
<This>@[0:4c]
>>> print type(t['SUBTOKENS'][0])
<class 'nltk.token.SafeToken'>

Probability (Probability)

One fairly simple thing you might want to do with a language repertoire is analyze the frequency distributions of the various events (events) in it and make a probability prediction based on those known frequency distributions. NLTK supports a variety of methods for probability prediction based on natural frequency distribution data. I won't cover those methods here (see the probability tutorial listed in resources), but suffice it to say that there is some fuzzy relationship between what you would definitely expect and what you already know (not just the obvious scaling/regularization).

Basically, NLTK supports two types of frequency distribution: histogram and conditional frequency distribution (conditional frequency). The nltk.probability.FreqDist class is used to create histograms; For example, you can create a histogram of 1 word:
Listing 2. Create a basic histogram using nltk.probability.FreqDist


>>> from nltk.probability import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> freq = FreqDist()
>>> for word in article['SUBTOKENS']:
...   freq.inc(word['TEXT'])
>>> freq.B()
1194
>>> freq.count('Python')
12

The probability tutorial discusses the creation of histograms for more complex features, such as "the length of a word after a word ending in a vowel sound." nltk. draw. plot. Plot class can be used in the histogram of visual display. Of course, you can also analyze the frequency distribution of high-level syntax features or even data sets unrelated to NLTK in this way.

Conditional frequency distributions can be more interesting than plain histograms. The conditional frequency distribution is one kind of two-dimensional histogram -- it shows you one histogram for each initial condition or "context." For example, the tutorial presents a word length distribution problem for each initial letter. Let's analyze it this way:
Listing 3. Conditional frequency distribution: word length for each initial letter


>>> cf = ConditionalFreqDist()
>>> for word in article['SUBTOKENS']:
...   cf[word['TEXT'][0]].inc(len(word['TEXT']))
...
>>> init_letters = cf.conditions()
>>> init_letters.sort()
>>> for c in init_letters[44:50]:
...   print "Init %s:" % c,
...   for length in range(1,6):
...     print "len %d/%.2f," % (length,cf[c].freq(n)),
...   print
...
Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03,
Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12,
Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18,
Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,

One of the great applications of conditional frequency distribution in language is the analysis of segmental distribution in a universal set -- for example, given a particular word, which word is most likely to appear next. Of course, grammar imposes some limitations; However, the study of the choice of syntactic options belongs to the category of semantics, pragmatics and terminology.

Stem extraction (Stemming)

nltk. stemmer. porter PorterStemmer class is one for the grammar from the English word (prefix) stems extremely convenient tools. This ability is particularly appealing to me as I have previously created a common, full-text indexed search tool/library using Python (see Developing a full text indexer in Python for descriptions, which have been used on quite a few other projects).

While the ability to search a large number of documents for a set of exact words is very useful (the work done by gnosis.indexer), for many search diagrams, a little bit of fuzziness will help. Perhaps you're not particularly sure if the email you're looking for USES the words "complicated," "complications," "complicating," or "complicates," but you do remember that it's a general reference (perhaps along with some other words to complete a valuable search).

NLTK includes an excellent algorithm for word stem extraction and lets you customize the stem extraction algorithm to your liking:
Listing 4. Extract the word stem for the morphological root (morphological roots)


>>> from nltk.stemmer.porter import PorterStemmer
>>> PorterStemmer().stem_word('complications')
'complic'

In fact, how you can take advantage of stem extraction in gnosis.indexer and its derivatives or in a completely different indexing tool depends on your usage context. Fortunately, gnosis.indexer has an open interface that is easy to customize. Do you need an index composed entirely of stems? Or do you include both the full word and stem in the index? Do you need to separate the stem match in the result from the exact match? In a future version of gnosis.indexer I will introduce some sort of stem extraction capability, but the end user may still want to customize it differently.

In any case, 1 adding a stem extract is generally very simple: first, get the stem from a document by specifying gnosis.indexer.TextSplitter; Then, of course, when a search is performed, (optionally) its stem is extracted before an indexed lookup using the search criteria, possibly by customizing your MyIndexer.find () method.

While using PorterStemmer, I found the nltk.tokenizer.WSTokenizer class to be as bad as the tutorial warns. It can play a conceptual role, but for actual text, you can better identify what a "word" is. Fortunately, gnosis.indexer.TextSplitter is a robust word breaker. Such as:
Listing 5. Stem extraction based on the clumsy NLTK word breaking tool


>>> from nltk.tokenizer import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> from nltk.probability import *
>>> from nltk.stemmer.porter import *
>>> stemmer = PorterStemmer()
>>> stems = FreqDist()
>>> for word in article['SUBTOKENS']:
...   stemmer.stem(word)
...   stems.inc(word['STEM'].lower())
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[20:40]
['"generator-bas', '"implement', '"lazili', '"magic"', '"partial',
'"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."',
'"state', '"understand', '"weightless', '"whatev', '#', '#-----',
'#----------', '#-------------', '#---------------', '#b17:']

Look at some stems. Not all of the stems in the collection seem to be available for indexing. Many are not real words at all, and others are combinations of dashes and irrelevant punctuation. Let's try using a better word breaker:
Listing 6. Use the clever heuristic in the word breaking tool for stem extraction


>>> article = TS().text_splitter(open('cp-b17.txt').read())
>>> stems = FreqDist()
>>> for word in article:
...   stems.inc(stemmer.stem_word(word.lower()))
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[60:80]
['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte',
'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast',
'certain', 'certainli', 'chang', 'charm']

Here, you can see that some words have multiple possible extensions, and all of them look like words or morphemes. Word segmentation is very important for random text sets. To be fair, the NLTK bundled complete set has been packaged with WSTokenizer() as an easy to use and accurate word breaker. To get a robust indexer that is actually available, you need to use a robust word breaker.

Adding tags (tagging), chunking (chunking), and parsing (parsing)

The largest part of NLTK consists of various parsers with varying levels of complexity. For the most part, this introduction will not explain them in detail, but I would like to outline what they are trying to accomplish.

Don't forget that flags are the background for special dictionaries -- specifically those that can contain an TAG key to indicate the grammatical role of a word. NLTK full set documents usually have some of the specialized languages pre-tagged, but of course you can add your own tags to untagged documents.

Chunking is something like "rough parsing." That is, chunking works either on existing flags based on grammatical elements or on flags you manually add or semi-automatically generate using regular expressions and program logic. However, this is not exactly parsing (there is no same generation rule). Such as:
Listing 7. Parsing/adding tags in chunks: words and larger units


>>> from nltk.parser.chunk import ChunkedTaggedTokenizer
>>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]"
>>> sentence = Token(TEXT=chunked)
>>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP')
>>> tokenizer.tokenize(sentence)
>>> sentence['SUBTOKENS'][0]
(NP: <the/DT> <little/JJ> <cat/NN>)
>>> sentence['SUBTOKENS'][0]['NODE']
'NP'
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]
<the/DT>
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG']
'DT'
>>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS'])
(S:
 (NP: <the/DT> <little/JJ> <cat/NN>)
 <sat/VBD>
 <on/IN>
 (NP: <the/DT> <mat/NN>))

The chunking mentioned can be done by the nltk.tokenizer.RegexpChunkParser class using pseudo-regular expressions to describe the 1 series of tags that make up the syntax elements. Here is an example from the probability tutorial:
Listing 8. Chunking using the regular expression on the label


>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>',
...        'Chunk optional det, zero or more adj, and a noun')
>>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S')
>>> chunkparser.parse(sentence)
>>> print sent['TREE']
(S: (NP: <the/DT> <little/JJ> <cat/NN>)
 <sat/VBD> <on/IN>
 (NP: <the/DT> <mat/NN>))

Real analysis will lead us into many theoretical areas. For example, the top-down parser can ensure that every possible product is found, but it can be very slow because of the frequent (exponential) backtracking. Shift-reduce is more efficient, but may miss out on some products. In either case, the declaration of a grammar rule is similar to the syntax declaration used to parse an artificial language. This column has covered one of these: SimpleParse, mx.TextTools, Spark, and gnosis.xml.validity (see resources).

Even more, in addition to the top-down and shift-reduce parsers, NLTK provides an "chart parser," which can create partial assumptions so that a given sequence can then complete a rule. This approach can be both effective and complete. Take a vivid (toy level) example:
Listing 9. Define the basic product for the context-free syntax


>>> from nltk.parser.chart import *
>>> grammar = CFG.parse('''
...  S -> NP VP
...  VP -> V NP | VP PP
...  V -> "saw" | "ate"
...  NP -> "John" | "Mary" | "Bob" | Det N | NP PP
...  Det -> "a" | "an" | "the" | "my"
...  N -> "dog" | "cat" | "cookie"
...  PP -> P NP
...  P -> "on" | "by" | "with"
...  ''')
>>> sentence = Token(TEXT='John saw a cat with my cookie')
>>> WSTokenizer().tokenize(sentence)
>>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT')
>>> parser.parse_n(sentence)
>>> for tree in sentence['TREES']: print tree
(S:
 (NP: <John>)
 (VP:
  (VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>)))
  (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>)))))
(S:
 (NP: <John>)
 (VP:
  (V: <saw>)
  (NP:
   (NP: (Det: <a>) (N: <cat>))
   (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))

probabilistic context-free grammar (or PCFG) is a context-free grammar that associates each of its products with a probability. Similarly, the parser for probabilistic parsing is tied to NLTK.

What are you waiting for?

NLTK also has other important features that are not covered in this brief introduction. For example, NLTK has a complete framework for text categorization through statistical techniques such as "naive Bayesian" and "maximum entropy" models. Even if there were space, I can't explain the nature of it right now. However, I believe that even the lower layer of NLTK can be a practical framework for both teaching and practical applications.