A brief discussion on the encoding problem of Python2 obtaining Chinese file names

  • 2020-06-23 00:58:05
  • OfStack

Question:

Python2 gets the filename containing Chinese characters if it is not transcoded.

Here, it is assumed that the folder to be tested is named test, under which there are five files with Chinese names:

Python Performance analysis and Optimization.pdf

Python Data analysis and Mining practice.pdf

Python Programming practice: Use design patterns, concurrency, and libraries to create high-quality programs.pdf

Smooth Python pdf

59 Effective ways to Write quality Python code.pdf

First, we directly print the obtained file name without transcoding. The code is as follows:


import os
for file in os.listdir('./test'):
 print(file)

Output garbled code:


Python���ܷ������Ż�.pdf
Python���ݷ������ھ�ʵս.pdf
Python���ʵս���������ģʽ�������ͳ���ⴴ������������.pdf
������Python.pdf
�� д ������Python�����59���� Ч ����.pdf

Solution:

First, test the encoding of the file name under 1. Here we use chardet module and install the command:


pip install chardet

chardet. detect function is used to detect the encoding method of file names under 1:


{'confidence': 0.99, 'encoding': 'GB2312'}
{'confidence': 0.99, 'encoding': 'GB2312'}
{'confidence': 0.99, 'encoding': 'GB2312'}
{'confidence': 0.73, 'encoding': 'windows-1252'}
{'confidence': 0.99, 'encoding': 'GB2312'}

It can be seen that the encoding GB2312 has the highest confidence. Below, we use the encoding GB2312 to decode the file name. The code is as follows:


import os
import chardet
for file in os.listdir('./test'):
 r = file.decode('GB2312')
 print(r)

Output:

Python performance analysis and Optimization.pdf

Python data analysis and mining practice.pdf

Python Programming practice: Create high-quality programs using design patterns, concurrency, and libraries

Smooth Python pdf

59 Effective ways to Write high quality Python code.pdf

After encoding, the file name is printed correctly.

PS: chardet. The longer the string, the more accurate it is. The shorter the string, the less accurate it is

Here is another problem is that the above code is tested under Windows, and the filename code under Linux is ES84en-8. In order to be compatible with Windows and Linux, the code needs to be modified 1. Here we wrap the code into the function:


# -*- coding: utf-8 -*-
import os

def get_filename_from_dir(dir_path):
 file_list = []
 if not os.path.exists(dir_path):
  return file_list
 for item in os.listdir(dir_path):
  basename = os.path.basename(item)
  # print(chardet.detect(basename)) #  Find the filename encoding , The filename contains Chinese 
  # windows The following file is encoded as GB2312 . linux for utf-8
  try:
   decode_str = basename.decode("GB2312")
  except UnicodeDecodeError:
   decode_str = basename.decode("utf-8")
  file_list.append(decode_str)
 return file_list
#  The test code 
r = get_filename_from_dir('./test')
for i in r:
 print(i)

It is decoded with GB2312, and decoded with utf-8 if an error occurs, which makes it compatible with BOTH Windows and Linux (passed in Win7 and ES95en16.04).


Related articles: