Detail text processing in Python

  • 2020-05-09 18:46:50
  • OfStack

String - an immutable sequence

Like most advanced programming languages 1, variable-length strings are the basic type in Python. Python allocates memory "behind the scenes" to hold strings (or other values) that programmers don't have to worry about. Python also has some string handling capabilities that are not available in other high-level languages.

In Python, strings are "immutable sequences." Although you can't modify strings "by location" (such as byte groups), a program can refer to elements or subsequences of strings, just like any sequence 1. Python USES a flexible "sharding" operation to refer to a subsequence, and the format of the character fragment is similar to that of a range of rows or columns in a spreadsheet. The following interactive sessions illustrate the use of strings and character fragments:
Strings and sharding


>>> s = 
    "mary had a little lamb"
>>> s[0] 
    # index is zero-based

    'm'
>>> s[3] = 
    'x' 
    # changing element in-place fails
Traceback (innermost last):
 File 
    "<stdin>", line 1, 
    in
     ?
TypeError: object doesn't support item assignment
>>> s[11:18] 
    # 'slice' a subsequence

    'little '
>>> s[:4] 
    # empty slice-begin assumes zero

    'mary'
>>> s[4] 
    # index 4 is not included in slice [:4]

    ' '
>>> s[5:-5] 
    # can use "from end" index with negatives

    'had a little'
>>> s[:5]+s[5:] 
    # slice-begin & slice-end are complimentary

    'mary had a little lamb'

Another powerful string operation is the simple in keyword. It provides two intuitively efficient constructs:
in keyword


>>> s = 
    "mary had a little lamb"
>>> 
    for
     c 
    in
     s[11:18]: 
    print
     c, 
    # print each char in slice
...
l i t t l e
>>> 
    if
    'x' 
    in
     s: 
    print
    'got x' 
    # test for char occurrence
...
>>> 
    if
    'y' 
    in
     s: 
    print
    'got y' 
    # test for char occurrence
...
got y

In Python, there are several ways to compose string literals. You can use single or double quotation marks, as long as the opening and closing quotation marks match, and other variations of quotation marks are common. If a string contains a newline character or an embedded quotation mark, 3-double quotation marks can easily define such a string, as shown in the following example:
Use of double quotation marks


>>> s2 = 
    """Mary had a little lamb
... its fleece was white as snow
... and everywhere that Mary went
... the lamb was sure to go"""
>>> 
    print
     s2
Mary had a little lamb
its fleece was white as snow
    and
     everywhere that Mary went
the lamb was sure to go

Strings with single or triple quotes can be preceded by a letter "r" to indicate that Python should not interpret regular expression special characters. Such as:
The use of "r - strings"


>>> s3 = 
    "this \n and \n that"
>>> 
    print
     s3
this
    and

    that
>>> s4 = r
    "this \n and \n that"
>>> 
    print
     s4
this \n 
    and
     \n that

In "r-strings," the backslash that might otherwise make up the escape character is treated as a regular backslash. This topic will be explained one step further in a future discussion of regular expressions.

File and string variables

When we talk about "text processing," we usually mean the content we process. Python makes it easy to read the contents of a text file into a string variable that you can manipulate. The file object provides three "read" methods:.read (),.readline (), and.readlines (). Each method can accept 1 variable to limit the amount of data read at a time, but they usually do not use variables. .read () reads the entire file at a time, which is typically used to put the contents of the file into one string variable. However,.read () generates the most direct string representation of the contents of the file, but it is not necessary for continuous line-oriented processing and is not possible if the file is larger than available memory.

.readline () and.readlines () are very similar. They are used in structures similar to the following:
Python. readlines () sample


    fh = open(
    'c:\\autoexec.bat')
    for
     line 
    in
     fh.readlines():
 
    print
     line

The difference between.readline () and.readlines () is that the latter reads the entire file once, like.read () 1. .readlines () automatically parses the contents of the file into a list of 1 lines, which can be created by for of Python... in... The structure is processed. On the other hand,.readline () reads only one row at a time, which is usually much slower than.readlines (). You should use.readline () only if you don't have enough memory to read the entire file once.

If you are using a standard module for handling files, you can use the cStringIO module to convert the string to a "virtual file" (if you need to subclass the module, you can use the StringIO module, which is not necessary for beginners). Such as:
cStringIO module


>>> 
    import
     cStringIO
>>> fh = cStringIO.StringIO()
>>> fh.write(
    "mary had a little lamb")
>>> fh.getvalue()
    'mary had a little lamb'
>>> fh.seek(5)
>>> fh.write(
    'ATE')
>>> fh.getvalue()
    'mary ATE a little lamb'

However, keep in mind that cStringIO "virtual files" are not permanent, which is different from real files. If you do not save it (such as writing it to a real file, or using the shelve module or database), it will disappear when the program ends.

Standard module: string

The string module is perhaps the most commonly used module in the Python 1.5.* standard distribution. In fact, in Python 1.6 or later, the functionality in the string module will be used as a built-in string method (details have not been published at the time of this writing). Of course, any program that performs text processing tasks should probably start with the following line:
Start using the string method

          import string

A general rule of thumb tells us that if you can use the string module to do the job, that's the right way to do it. string functions are generally faster than re (regular expressions), and in most cases they are easier to understand and maintain. The third Python module, including some quick modules written in C, is suitable for specialized tasks, but portability and familiarity suggest using string whenever possible. If you're used to other languages, there will be exceptions, but not as many as you think.

The string module contains several types of things, such as functions, methods, and classes; It also contains strings of common constants. Such as:
string by law 1


>>> 
    import
     string
>>> string.whitespace
    '\011\012\013\014\015 '
>>> string.uppercase
    'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

Although these constants can be written out by hand, the string version more or less ensures that the constants will be correct for the national language and platform where the Python script is running.

string also includes functions that convert strings in common ways that can be combined to form several rare transformations. Such as:
string 2


>>> 
    import
     string
>>> s = 
    "mary had a little lamb"
>>> string.capwords(s)
    'Mary Had A Little Lamb'
>>> string.replace(s, 
    'little', 
    'ferocious')
    'mary had a ferocious lamb'

There are many other transformations that are not specified here; You can find the details in the Python manual.

You can also use the string function to report string properties, such as the length or position of a substring, such as:
string 3


>>> 
    import
     string
>>> s = 
    "mary had a little lamb"
>>> string.find(s, 
    'had')5>>> string.count(s, 
    'a')4

Finally, string offers a very Python quirk. The.split () and.join () pairs provide a quick way to convert between strings and byte groups, and you'll find them very useful. The usage is simple:
string law 4


>>> 
    import
     string>>> s = 
    "mary had a little lamb"
>>> L = string.split(s)
>>> L
[
    'mary', 
    'had', 
    'a', 
    'little', 
    'lamb']
>>> string.join(L, 
    "-")
    'mary-had-a-little-lamb'

Of course, in addition to.join (), lists may be used for other purposes (such as some involving the familiar for... in... Structural things).

Standard module: re

The re module deprecates the regex and regsub modules used in the old Python code. While it still has a few limited advantages over regex, these are trivial and not worth using in the new code. Obsolete modules may be removed from future Python releases, and version 1.6 May have an improved interface compatible re module. So, regular expressions will still use the re module.

Regular expressions are complex. Maybe someone will write a book on the subject, but in fact, many people already do! This article attempts to capture the "full form" of regular expressions so that the reader can grasp it.

Regular expressions are a very concise way to describe patterns that might appear in text. Will any characters appear? Does it appear in a particular order? Will the subpattern be repeated a definite number of times? Are other subpatterns excluded from the match? Conceptually, it seems impossible to describe patterns intuitively in natural language. The trick is to encode this description using the concise syntax of regular expressions.

When dealing with a regular expression, treat it as its own programming problem, even if it involves only one or two lines of code. These lines effectively make up a small program.

Start small. At its most basic, any regular expression involves matching a particular "character class." The simplest character class is a single character, which is only one word in the schema. Typically, you want to match 1 class of characters. You can indicate that this is a class by enclosing the class in square brackets; In parentheses, you can have a set of characters or a range of characters specified with a dash. You can also use a number of named character classes to determine your platform and national language. Here are some examples:
Character classes


>>> s = 
    "mary had a little lamb"
>>> 
    for
     c 
    in
     s[11:18]: 
    print
     c, 
    # print each char in slice
...
l i t t l e
>>> 
    if
    'x' 
    in
     s: 
    print
    'got x' 
    # test for char occurrence
...
>>> 
    if
    'y' 
    in
     s: 
    print
    'got y' 
    # test for char occurrence
...
got y

0

Character classes can be thought of as "atoms" of regular expressions, often combining those atoms into "molecules." You can do this using a combination of grouping and looping. Grouping by parentheses: any subexpression contained in parentheses is treated as an atom for later grouping or looping. A loop is represented by one of the following operators: "*" means "zero or more"; +" means "one or more"; The & # 63;" Stands for zero or one. For example, see the following example:
Sample rule expression

ABC([d-w]*\d\d?)+XYZ

For a string to match this expression, it must begin with "ABC" and end with "XYZ" -- but what must it have in the middle? The middle subexpression is ([d-w]*\d\d?) , followed by the "1 or more" operator. Therefore, the middle of the string must contain one (or two, or 1,000) characters or strings that match the subexpression in parentheses. The string "ABCXYZ" does not match because it has no necessary characters in the middle.

But what is this internal subexpression? It begins with zero or more letters in the d-w range. Note: zero is a valid match, although it may be awkward to describe it using the English word "some". Next, the string must have exactly one number; And then you have a zero or an extra number. The first numeric character class does not have a loop operator, so it appears only once. The second numeric character class is "?" Operator.) All in all, this translates into "one or two Numbers". Here are some strings that match regular expressions:
The string that matches the sample expression


>>> s = 
    "mary had a little lamb"
>>> 
    for
     c 
    in
     s[11:18]: 
    print
     c, 
    # print each char in slice
...
l i t t l e
>>> 
    if
    'x' 
    in
     s: 
    print
    'got x' 
    # test for char occurrence
...
>>> 
    if
    'y' 
    in
     s: 
    print
    'got y' 
    # test for char occurrence
...
got y

1

There are also 1 expressions that don't match regular expressions (think 1, why they don't match) :
Does not match the string of the sample expression


>>> s = 
    "mary had a little lamb"
>>> 
    for
     c 
    in
     s[11:18]: 
    print
     c, 
    # print each char in slice
...
l i t t l e
>>> 
    if
    'x' 
    in
     s: 
    print
    'got x' 
    # test for char occurrence
...
>>> 
    if
    'y' 
    in
     s: 
    print
    'got y' 
    # test for char occurrence
...
got y

2

It takes some practice to get used to creating and understanding regular expressions. Once you have mastered regular expressions, however, you have great expressive power. That said, it's often easy to move to using regular expressions to solve problems that can actually be solved using simpler (and faster) tools like string.


Related articles: