python Cutting Chinese and English through re Regular Expressions
- 2021-10-11 19:12:02
- OfStack
I won't talk too much, let's just look at the code ~
import re
s = 'alibaba Alibaba ' # String to be segmented
en_letter = '[\u0041-\u005a|\u0061-\u007a]+' # Upper and lower case English letters
zh_char = '[\u4e00-\u9fa5]+' # Chinese character
print(re.findall(zh_char,s) + re.findall(en_letter,s))
# Output: [' Alibaba ', 'alibaba']
范围 | 说明 |
---|---|
\u4e00-\u9fa5 | 汉字的unicode范围 |
\u0030-\u0039 | 数字的unicode范围 |
\u0041-\u005a | 大写字母unicode范围 |
\u0061-\u007a | 小写字母unicode范围 |
Supplement: python-Chinese-English mixed string segmentation (Chinese by word separation, English by word separation, numbers by space and other special symbols broken)
Sentences to be segmented:
s = "12, China 's Legend Holdings will split its several business arms to public on stock markets, the group' ES30president Zhu Linan said on Tuesday. Julian, president of the group, said on Tuesday that haha China Legend Holdings will split its business units and list them on the stock market. "
Segmentation results:
['12', 'china', 's', 'legend', 'legend', 'holdings', 'will', 'split', 'its', 'several', 'business', 'arms', 'go', 'public', 'stock', 'stock', 'markets', 'the', 'group', 'said', 'said', 'said', 'said', 'Regiment', 'General', 'Cut', 'Zhu', 'Li', 'An', 'Zhou', '2', 'Table', 'Show', 'haha', 'China', 'Country', 'Union', 'Think', 'Control', 'Stock', 'Will', 'Divide', 'Dismantle', 'Its', 'Many', 'Individual', 'Industry', 'Service', 'Ministry', 'Door', 'In', 'Stock', 'City', 'Shang', 'City']
Code:
import re
def get_word_list(s1):
# Separate sentences by words, Chinese by words, English by words and numbers by spaces
regEx = re.compile('[\\W]*') # We can use regular expressions to segment sentences, and the rules of segmentation are any string except words and numbers
res = re.compile(r"([\u4e00-\u9fa5])") # [\u4e00-\u9fa5] Chinese range
p1 = regEx.split(s1.lower())
str1_list = []
for str in p1:
if res.split(str) == None:
str1_list.append(str)
else:
ret = res.split(str)
for ch in ret:
str1_list.append(ch)
list_word1 = [w for w in str1_list if len(w.strip()) > 0] # Remove empty characters
return list_word1
if __name__ == '__main__':
s = "12 , China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday. Julian Zhou, president of the group 2 Indicates that, haha Legend Holdings of China will spin off several of its business units and list them on the stock market. "
list_word1=get_word_list(s)
print(list_word1)