Write a simple Python program to determine the language of the text

2020-05-05 11:28:35
OfStack

Description of the problem

When Python is used for text processing, sometimes the text processed contains text of multiple languages such as Chinese, English and Japanese, and sometimes it cannot be processed at the same time. At this time, it is necessary to identify which language family the current text belongs to. There is an langid toolkit in Python that provides this functionality, and langid currently supports detection in 97 languages, which is very useful.

2. Program code

The following Python is the program code that invokes the langid toolkit for language detection and discrimination of text:


import langid                             # The introduction of langid The module  
  
def translate(inputFile, outputFile): 
  fin = open(inputFile, 'r')                  # Open the input file as a read  
  fout = open(outputFile, 'w')                 # Open the output file as a write  
  
  for eachLine in fin:                     # Read each line in turn  
    line = eachLine.strip().decode('utf-8', 'ignore')   # Remove the first space of each line, etc., and uniformly convert to Unicode 
    lineTuple = langid.classify(line)           # call langid To perform language detection on the line  
    if lineTuple[0] == "zh":               # If the language is mostly Chinese, nothing is done  
      continue 
  
    outstr = line                     # If the line is not in Chinese, prepare the output  
    fout.write(outstr.strip().encode('utf-8') + '\n')   # Output non-chinese lines from Unicode Converted into utf-8 The output  
  
  fin.close() 
  fout.close() 
  
if __name__ == '__main__':                      # The equivalent of main function  
  translate("myInputFile.txt", "myOutputFile.txt")

The code above is used to process a text that outputs lines that are not in Chinese to a new file.

3. Pay attention to

The output of lines 9 and 10, langid.classify (line), is a binary group. The first item in the binary group represents the language family to which the text belongs, such as zh for Chinese, en for English, and so on. The second term of the binary group represents the proportion of the text that belongs to the first term.

I hope that's helpful.