Python is used to separate Chinese and English from text

  • 2021-01-19 22:19:41
  • OfStack

In text analysis and keyword extraction, news comments and other texts are usually a mixture of Chinese, English and other languages. If the text is directly analyzed without processing, the results are often unsatisfactory.

The following is a summary of the separation of Chinese and English texts:

1. Ultra-short text, ASCII recognition.


s = "China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday. The group's president Julian Chou 2 Said China Legend Holdings would spin off several of its business units and list them on the stock market. "
result = "".join(i for i in s if ord(i) < 256)
print(result)

out:
China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday.

2. unicode coding recognition


import re
s = "China's Legend Holdings will split its several business arms to go public on stock markets, the group's president Zhu Linan said on Tuesday. The group's president Julian Chou 2 Said China Legend Holdings would spin off several of its business units and list them on the stock market. "
uncn = re.compile(r'[\u0061-\u007a,\u0020]')
en = "".join(uncn.findall(s.lower()))
print(en)

out:
chinas legend holdings will split its several business arms to go public on stock markets, the groups president zhu linan said on tuesday

The Chinese code range is: \ ES15en4ES16en00 -\ ES17en9ES18en5, and the corresponding [^\ ES19en4ES20en00 -\ ES21en9ES22en5] can match non-Chinese.

When matching English, you need to add a space [\u0020], otherwise there will be no space between words.


Related articles: