An Example of Realizing Chinese Text Clause by python

  • 2021-07-18 08:20:51
  • OfStack

English text clauses are relatively simple, so long as they are divided according to the terminator "." Chinese text clauses seem to be very simple, but there will be many troubles when they are implemented, especially when dealing with social media data, they will encounter problems such as irregular text format.

The following code constitutes a document clause for short text in 1 paragraph and 1 paragraph


import re
def cut_sent(infile, outfile):
  cutLineFlag = [" ? ", " ! ", " . "," … "] # The terminator used in this article can be modified 
  sentenceList = []
  with open(infile, "r", encoding="UTF-8") as file:
    oneSentence = ""
    for line in file:
      if len(oneSentence)!=0:
        sentenceList.append(oneSentence.strip() + "\r")
        oneSentence=""
      # oneSentence = ""
      for word in words:
        if word not in cutLineFlag:
          oneSentence = oneSentence + word
        else:
          oneSentence = oneSentence + word
          if oneSentence.__len__() > 4:
            sentenceList.append(oneSentence.strip() + "\r")
          oneSentence = ""
  with open(outfile, "w", encoding="UTF-8") as resultFile:
    print(sentenceList.__len__())
    resultFile.writelines(sentenceList)

If there is no terminator at the end of paragraph 3 but the line breaks, this sentence may be lost, so add the following code:


   if len(oneSentence)!=0:
   sentenceList.append(oneSentence.strip() + "\r")
     oneSentence=""

Better processing results will be obtained

Text to process:


 Since WeChat came out, 3 With the visible function, my circle of friends is getting colder and colder, and there is nothing to see when I click on it. Today, I put the blocked purchasing 1 A 1 A 1 All of them have been released. It's almost New Year's Day. Be lively 
1 Women, want to DIY Correct incisor gap, make a model at home, the result is tragic, because the plaster is used, I can't take it out, come to our hospital for help, the doctor waste it 9 Cattle 2 The power of the tiger is done... DIY There are risks, so be careful in operation! 
 My daughter's classmates are domesticated 1 A parrot, two pearl birds, 1 One cat, two hamsters. Parrots are the eldest, pearl birds are afraid of them, cats are backward, and they are also afraid of parrots. Hamsters often slip out of their cages, and cats are said to catch them and put them back in their cages. 

Text after processing:


 Since WeChat came out, 3 With the visible function, my circle of friends is getting colder and colder, and there is nothing to see when I click on it. 
 Today, I put the blocked purchasing 1 A 1 A 1 All of them have been released. It's almost New Year's Day. Be lively 
1 Women, want to DIY Correct incisor gap, make a model at home, the result is tragic, because the plaster is used, I can't take it out, come to our hospital for help, the doctor waste it 9 Cattle 2 The power of the tiger is done … 
DIY There are risks, so be careful in operation! 
 My daughter's classmates are domesticated 1 A parrot, two pearl birds, 1 One cat, two hamsters. 
 Parrots are the eldest, pearl birds are afraid of them, cats are backward, and they are also afraid of parrots. 
 Hamsters often slip out of their cages, and cats are said to catch them and put them back in their cages. 

Better clause results are obtained without losing information.


Related articles: