Python is used to separate Chinese and English from text
- 2021-01-19 22:19:41
- OfStack
In text analysis and keyword extraction, news comments and other texts are usually a mixture of Chinese, English and other languages. If the text is directly analyzed without processing, the results are often unsatisfactory.
The following is a summary of the separation of Chinese and English texts:
1. Ultra-short text, ASCII recognition.
s = "China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday. The group's president Julian Chou 2 Said China Legend Holdings would spin off several of its business units and list them on the stock market. "
result = "".join(i for i in s if ord(i) < 256)
print(result)
out:
China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday.
2. unicode coding recognition
import re
s = "China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday. The group's president Julian Chou 2 Said China Legend Holdings would spin off several of its business units and list them on the stock market. "
uncn = re.compile(r'[\u0061-\u007a,\u0020]')
en = "".join(uncn.findall(s.lower()))
print(en)
out:
chinas legend holdings will split its several business arms to go public on stock markets, the groups president zhu linan said on tuesday
The Chinese code range is: \ ES15en4ES16en00 -\ ES17en9ES18en5, and the corresponding [^\ ES19en4ES20en00 -\ ES21en9ES22en5] can match non-Chinese.
When matching English, you need to add a space [\u0020], otherwise there will be no space between words.