Write a simple Python program to determine the language of the text
- 2020-05-05 11:28:35
- OfStack
Description of the problem
When Python is used for text processing, sometimes the text processed contains text of multiple languages such as Chinese, English and Japanese, and sometimes it cannot be processed at the same time. At this time, it is necessary to identify which language family the current text belongs to. There is an langid toolkit in Python that provides this functionality, and langid currently supports detection in 97 languages, which is very useful.
2. Program code
The following Python is the program code that invokes the langid toolkit for language detection and discrimination of text:
import langid # The introduction of langid The module
def translate(inputFile, outputFile):
fin = open(inputFile, 'r') # Open the input file as a read
fout = open(outputFile, 'w') # Open the output file as a write
for eachLine in fin: # Read each line in turn
line = eachLine.strip().decode('utf-8', 'ignore') # Remove the first space of each line, etc., and uniformly convert to Unicode
lineTuple = langid.classify(line) # call langid To perform language detection on the line
if lineTuple[0] == "zh": # If the language is mostly Chinese, nothing is done
continue
outstr = line # If the line is not in Chinese, prepare the output
fout.write(outstr.strip().encode('utf-8') + '\n') # Output non-chinese lines from Unicode Converted into utf-8 The output
fin.close()
fout.close()
if __name__ == '__main__': # The equivalent of main function
translate("myInputFile.txt", "myOutputFile.txt")
The code above is used to process a text that outputs lines that are not in Chinese to a new file.
3. Pay attention to
The output of lines 9 and 10, langid.classify (line), is a binary group. The first item in the binary group represents the language family to which the text belongs, such as zh for Chinese, en for English, and so on. The second term of the binary group represents the proportion of the text that belongs to the first term.
I hope that's helpful.