Implementation of Python to detect rare words

2020-05-12 02:52:47
OfStack

solution

The first thing I thought of was to use python's regular expressions to match illegal characters, and then find the illegal records. However, the ideal is always full, the reality is cruel. In the process of implementation, I realized my lack of knowledge about character encoding and the internal string representation of python. During this period, stepped over many pits, to the end, although there is still some fuzzy place, but finally have a clear understanding of the overall. Take notes here to avoid falling in the same place later.

The following test environment is the python 2.7.8 environment that comes with ArcGIS 10.3. There is no guarantee that other python environments will be applicable.

python regular expressions

The regular functionality in python is provided by the built-in re library, which USES three functions. re.compile() Provide reusable regular expressions, match() and search() The function returns the matching result. The difference between the two is: match() Match from the specified location, search() Searches backwards from the specified location until a matching string is found. For example, in the following code, match_result Starting from the first character f, a null value is returned when the match fails. search_result Search backwards from f until the first matching character a is found, and then output the matching result as a through the group() function.


import re

pattern = re.compile('[abc]')
match_result = pattern.match('fabc')
if match_result:
 print match_result.group()

search_result = pattern.search('fabc')
if search_result:
 print search_result.group()

The above implementation requires compiling 1 pattern and then matching it. In fact, we can use it directly re.match(pattern, string) Function to perform the same function. However, the direct matching method is not as flexible as the method of compiling first and then matching. Firstly, regular expressions cannot be reused. If a large amount of data is matched with 1 pattern, it means that internal compilation is needed every time, resulting in performance loss. In addition, re.match() Function is not pattern.match() The latter can specify where to start the match.

Coding problem

Once you understand the basics of python regularizations, all that remains is to find a suitable regular expression to match rare and illegal characters. Illegal characters are easy to match using the following pattern:


pattern = re.compile(r'[~!@#$%^&* ]')

However, for rare word matching, really stumped me. The first is the definition of rare words, what kind of words are rare words? After consulting with the project manager, it was stipulated that characters other than GB2312 were rare words. The next question is, how do I match GB2312 characters?

Upon inquiry, the range of GB2312 is match() 0 , where the range of the Chinese character area is [\xB0-\xF7][\xA1-\xFE] . Therefore, the expression after adding a rare word match is:


pattern = re.compile(r'[~!@#$%^&* ]|[^\xA1-\xF7][^\xA1-\xFE]')

The problem seems to have been solved properly, but I still too simple too naive. Since the strings to judge are all read from a layer file, arcpy nicely encodes the read characters in unicode format. Therefore, I need to find out the encoding range of the GB2312 character set in unicode. But the reality is that the distribution of the GB2312 character set in unicode is not continuous, and the use of regular representations of this range must be very complex. The idea of using regular expressions to match rare words seems to have hit a dead end.

The solution

Since the supplied string is in unicode format, can I convert it to GB2312 and match again? Actually no, because the unicode character set is much larger than the GB2312 character set, so GB2312 => unicode It can always be done, and vice versa unicode => GB2312 No one can guarantee success.

All of a sudden, this gives me another way of thinking about, let's say, 1 string unicode => GB2312 If the conversion fails, does it just mean that it does not belong to the GB2312 character set? So, I use unicode_string.encode('GB2312') The function attempts to convert a string, catching an UnicodeEncodeError exception to recognize a rare word.

The final code is as follows:


import re

def is_rare_name(string):
 pattern = re.compile(u"[~!@#$%^&* ]")
 match = pattern.search(string)
 if match:
 return True

 try:
    string.encode("gb2312")
  except UnicodeEncodeError:
   return True

  return False

conclusion

The above is the whole content of this article, I hope the content of this article to your study or work can bring 1 definite help, if you have questions you can leave a message to communicate.