python Cutting Chinese and English through re Regular Expressions

2021-10-11 19:12:02
OfStack

I won't talk too much, let's just look at the code ~


import re 
s = 'alibaba Alibaba ' #  String to be segmented 
en_letter = '[\u0041-\u005a|\u0061-\u007a]+' #  Upper and lower case English letters 
zh_char = '[\u4e00-\u9fa5]+' #  Chinese character 
 
print(re.findall(zh_char,s) + re.findall(en_letter,s))
 
#  Output:  [' Alibaba ', 'alibaba']

范围	说明
\u4e00-\u9fa5	汉字的unicode范围
\u0030-\u0039	数字的unicode范围
\u0041-\u005a	大写字母unicode范围
\u0061-\u007a	小写字母unicode范围

Supplement: python-Chinese-English mixed string segmentation (Chinese by word separation, English by word separation, numbers by space and other special symbols broken)

Sentences to be segmented:

s = "12, China 's Legend Holdings will split its several business arms to public on stock markets, the group' ES30president Zhu Linan said on Tuesday. Julian, president of the group, said on Tuesday that haha China Legend Holdings will split its business units and list them on the stock market. "

Segmentation results:

['12', 'china', 's', 'legend', 'legend', 'holdings', 'will', 'split', 'its', 'several', 'business', 'arms', 'go', 'public', 'stock', 'stock', 'markets', 'the', 'group', 'said', 'said', 'said', 'said', 'Regiment', 'General', 'Cut', 'Zhu', 'Li', 'An', 'Zhou', '2', 'Table', 'Show', 'haha', 'China', 'Country', 'Union', 'Think', 'Control', 'Stock', 'Will', 'Divide', 'Dismantle', 'Its', 'Many', 'Individual', 'Industry', 'Service', 'Ministry', 'Door', 'In', 'Stock', 'City', 'Shang', 'City']

Code:


import re
def get_word_list(s1):
  #  Separate sentences by words, Chinese by words, English by words and numbers by spaces 
  regEx = re.compile('[\\W]*')  #  We can use regular expressions to segment sentences, and the rules of segmentation are any string except words and numbers 
  res = re.compile(r"([\u4e00-\u9fa5])")  # [\u4e00-\u9fa5] Chinese range 
  p1 = regEx.split(s1.lower())
  str1_list = []
  for str in p1:
    if res.split(str) == None:
      str1_list.append(str)
    else:
      ret = res.split(str)
      for ch in ret:
        str1_list.append(ch)
  list_word1 = [w for w in str1_list if len(w.strip()) > 0] #  Remove empty characters 
  return list_word1
if __name__ == '__main__':
  s = "12 , China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday. Julian Zhou, president of the group 2 Indicates that, haha Legend Holdings of China will spin off several of its business units and list them on the stock market. "
  list_word1=get_word_list(s)
  print(list_word1)