Detail the number extraction method in Python3 string

  • 2020-05-19 05:13:28
  • OfStack

I went to an interesting blog and found an article about the ValueError: invalid literal for int() with base 10 error. For this error, the blogger has given a solution, using the re.sub method


 totalCount = '100abc'
 totalCount = re.sub("\D", "", totalCount) 

However, it did not explain what it meant, so I checked other information and made a note:

The definition of sub function in re module of Python3.5.2 official document is as follows:


re.sub(pattern, repl, string, count=0, flags=0)

All substrings matching the regular expression pattern are found in the string string and replaced with another string repl. If no string matching pattern is found, an unmodified string is returned. Repl can be either a string or a function.

From this, you can analyze the meaning of the statement used above: find the non-numeric character in the string '100abc' ('\D' in the regular expression means non-numeric), replace it with "", and return the string with only the digits left.


>>> totalCount = '100abc'

>>> totalCount = re.sub("\D", "", totalCount)

>>> print(totalCount)

100

>>> type(totalCount)

<class 'str'> 

Ok, that's all, but actually what I'm thinking about is the similar problem I encountered when I climbed to the q&a that zhihu paid attention to:


 answer_num_get = soup.find('h3', {'id': 'zh-question-answer-num'})  #  Number of answers: 32  answer 
 if answer_num_get is not None:
   answer_num = int(answer_num_get.split()[0])
 n = answer_num // 10 

The reason why the third line can use int () is that string.split ()[0] extracts the number of "32 answers" from the value of answer_num_get.

Definition of split() str.split (sep=None, maxsplit=-1)


>>> import string
>>> a = "32  An answer "
>>> b = a.split()[0]
>>> print(b)
32
>>> type(b)
<class 'str'>
>>> c = '1,2,3'
>>> c.split(',')
['1', '2', '3']
>>> c.split(',')[0]
'1'
>>> c.split(',')[1]
'2'
>>> 

It can be seen that the first parameter of split() is the separator, and if nothing is filled in, it is separated by Spaces by default.

The first method requires a regular expression, and the second method requires a delimiter (I guess that's why there's a space after the total answer number on the original page). Both methods are somewhat limited, and it is not known if there is a better way to separate the Numbers in a string.


Related articles: