About all the Python3 unicode features you don't want to know

2020-04-02 14:28:06
OfStack

My readers know that I am a person who revamps Python3 unicode. This time was no exception. I'll tell you how painful it is to use unicode and why I can't shut up. I've been working on Python3 for two weeks and I need to vent my frustration. There is still useful information in all this chiding, because it teaches us how to deal with Python3. If I'm not bugging you, read it.

This time the content will be different. It will not be associated with WSGI or HTTP or anything related to it. In general, I've been told that I should stop complaining about the Python3 Unicode system because I don't write the code that other people often write (HTTP libraries and the like), so I'm going to write something else this time: a command-line application. I wrote a handy library called click to make it easier to write.

Notice that I did what every novice Python programmer does: write a command line application. Hello World program. But unlike before, I wanted to make sure that the application was stable and Unicode supported for both Python2 and Python3, and that I was able to do unit testing. So here's how to do it.

What do we want to do

In Python3 we as developers need to use Unicode well. Obviously, I think this means that all text data is Unicode, and all non-text data is bytes. In a World where everything is black and white, the Hello World example is pretty straightforward. So let's write some shell tools.

This is an application implemented in Python2 form:


import sys
import shutil
 
for filename in sys.argv[1:]:
  f = sys.stdin
  if filename != '-':
    try:
      f = open(filename, 'rb')
    except IOError as err:
      print >> sys.stderr, 'cat.py: %s: %s' % (filename, err)
      continue
  with f:
    shutil.copyfileobj(f, sys.stdout)

Obviously, commands don't work particularly well with any command-line options, but at least they work. So let's start coding.

The UNICODE in UNIX

The code above doesn't work in Python2 because you're implicitly dealing with bytes. Command line arguments are bytes, file names are bytes, and file contents are bytes. The language guardians will point out that this is not true, and it can cause problems, but if you start thinking about it more, you'll see that it's a fluid problem.

UNIX is bytes, has been defined like this, and always will be. To understand why you need to look at different scenarios for data transfer.

terminal Command line argument Operating system input/output layer File system driver

By the way, this isn't the only thing that data can pass through, but let's see how many scenarios we can understand a code in. The answer is none. At the very least we need to understand that one code is the terminal output area information. This information can be used to represent the transformation and to understand the encoding that the text message has.

For example, if the LC_CTYPE value is en_us.utf-8 tells the application system to use US English, and most of the text data is encoded in utf-8. There are actually many other variables, but let's say that's the only one we need to look at. Note that LC_CTYPE does not mean that all data is utf-8 encoded. Instead of telling the application how to categorize text features and when transformations need to be applied.

This is important because of c locales. C locale is the only locale specified by POSIX, and it says that all ASCII encoding and replies from the command line tools will be treated as defined by the POSIX spec.

In our cat tool up here, if it's a bit, there's no other way to treat this data. The reason is that the shell doesn't specify what the data is. For example, if you call cat hello.txt, the terminal encodes hello.txt as it encodes the application.

But now think about this example echo *. The Shell passes all the file names of the current directory to your application. So what are they? File name no code!

UNICODE crazy

Now a Windows guy would look at this and say, what are the UNIX guys doing? But that is not so tragic. The reason for all this work is that some smart people designed the system to be backward compatible. Unlike Windows, which defines each API twice, on POSIX, the best approach is to assume that it is bytes for display purposes and encode it the default way.

Use the cat command above as an example. For example, there is an error message about files that cannot be opened, originally because they do not exist or they are protected, or whatever. We assume that the file is encoded in latin1 because it is from an external drive in 1995. The terminal will take the standard output, and it will try to encode it in utf-8, because that's what it thinks it is. Because the string is latin1 encoded, because it doesn't decode well. But don't worry, there won't be any crashes, because your terminal will ignore it when it can't handle it.

How does it look graphically? There are two versions of each. List all the files in a graphical interface like Nautilus. It associates the file name with the icon, and decodes it by double-clicking and trying to make the file name appear. For example, it will attempt to decode with utf-8, and replace errors with problem tokens. Your file name may not be completely readable but you can still open the file.

Unicode on UNIX is only crazy when you force everything to use it. But that's not the way unicode works on UNIX. UNIX has no API that distinguishes unicode from bytes. They're the same, which makes it easier to handle.

C Locale

C locales come up a lot of times here. C locales are a way to avoid POSIX specifications being forced to be applied anywhere. The POSIX compliant operating system needs to support setting LC_CTYPE to make everything ASCII.

So this is a locale that's chosen under different circumstances. You basically find that this locale provides an empty environment for all programs that are launched from cron, your initializers and child processes. C Locale restores a healthy ASCII band to the environment, otherwise you can't trust anything.

But the word ASCII says it's a 7bit code. This is not a problem because the operating system can handle bytes! Anything based on 8 bits will work fine, but if you follow the conventions of the operating system, character processing will be limited to the first 7 bits. Any information your tool generates will be encoded in ASCII and used in English.

Note that the POSIX specification does not say that your application should die in flames.

Python3 died in flames

Python3 takes a different position on unicode from UNIX. Python3 says: anything is Unicode (by default, except in some cases, unless we send double-coded data, but even then, sometimes it's still Unicode, albeit the wrong Unicode). The file name is Unicode, the terminal is Unicode, stdin and stdout are Unicode, there is so much Unicode. Since UNIX is not Unicode, Python3 now stands that it is right and UNIX is wrong, and that people should also modify the definition of POSIX to add Unicode. In this case, the file name is Unicode and the terminal is Unicode, so that you don't see errors caused by bytes.

I'm not alone. These are the bugs caused by Python's brain-dead idea of Unicode:

(link: http://bugs.python.org/issue13643#msg149941) (link: http://bugs.python.org/issue19977) (link: http://bugs.python.org/issue19846) (link: http://bugs.python.org/issue21398)

If you Google it, you'll find so much ridicule. See how many people fail to install the PIP module because of some characters in changelog, or because of the home folder, or because the SSH session is ASCII, or because they are connected using Putty.

Python3 cat

Now repair cat for Python3. How do we do that? First, we need to deal with the bytes, because something might show something that doesn't conform to the shell code. So anyway, the contents of the file need to be bytes. But we also need to turn on the base output to make it support bytes, which by default is not supported. We also need to handle separate cases where the Unicode API fails because the encoding is C. So this is the cat of Python3.


import sys
import shutil
 
def _is_binary_reader(stream, default=False):
  try:
    return isinstance(stream.read(0), bytes)
  except Exception:
    return default
 
def _is_binary_writer(stream, default=False):
  try:
    stream.write(b'')
  except Exception:
    try:
      stream.write('')
      return False
    except Exception:
      pass
    return default
  return True
 
def get_binary_stdin():
  # sys.stdin might or might not be binary in some extra cases. By
  # default it's obviously non binary which is the core of the
  # problem but the docs recomend changing it to binary for such
  # cases so we need to deal with it. Also someone might put
  # StringIO there for testing.
  is_binary = _is_binary_reader(sys.stdin, False)
  if is_binary:
    return sys.stdin
  buf = getattr(sys.stdin, 'buffer', None)
  if buf is not None and _is_binary_reader(buf, True):
    return buf
  raise RuntimeError('Did not manage to get binary stdin')
 
def get_binary_stdout():
  if _is_binary_writer(sys.stdout, False):
    return sys.stdout
  buf = getattr(sys.stdout, 'buffer', None)
  if buf is not None and _is_binary_writer(buf, True):
    return buf
  raise RuntimeError('Did not manage to get binary stdout')
 
def filename_to_ui(value):
  # The bytes branch is unecessary for *this* script but otherwise
  # necessary as python 3 still supports addressing files by bytes
  # through separate APIs.
  if isinstance(value, bytes):
    value = value.decode(sys.getfilesystemencoding(), 'replace')
  else:
    value = value.encode('utf-8', 'surrogateescape') 
      .decode('utf-8', 'replace')
  return value
 
binary_stdout = get_binary_stdout()
for filename in sys.argv[1:]:
  if filename != '-':
    try:
      f = open(filename, 'rb')
    except IOError as err:
      print('cat.py: %s: %s' % (
        filename_to_ui(filename),
        err
      ), file=sys.stderr)
      continue
  else:
    f = get_binary_stdin()
 
  with f:
    shutil.copyfileobj(f, binary_stdout)

This is not the worst version. It's not because I want to make things more complicated, it's just that complicated right now. For example, what is not done in the example is to read a binary thing is to force the clean text stdout. It's not necessary in this case because the print call here goes to stderr instead of stdout, but if you want to print some stdout, you have to clean it up. Why is that? Because stdout is a buffer on top of another buffer, if you don't force it to clean, your output order may go wrong.

Not just me, such as look: (link: https://github.com/twisted/twisted/blob/log-booyah-6750-4/twisted/python/compat.py), I will find the same trouble.

Do a coding dance

To understand the command-line arguments in the shell, by the way, there are some worst cases in Python3:

The shell passes the file name in bytes to the script The bytes are decoded by Python as expected before hitting your code. Because this is a lossy good process, Python3 USES a special error handler to handle decoding errors. Python code deals with a file that is error-free and needs to format an error message. Because when we write the text stream, if it's not illegal unicode, we don't write the substitution. Encode the unicode string containing the substitution as utf-8 and tell it to handle the substitution escape. Then we decode from utf-8 and tell him to ignore the error The resulting string returns to the text-only stream Then the terminal will decode our string to display it

Here's what Python2 looks like:

The shell passes the file name as a byte to the script The shell decodes strings for display

Because the string handling in Python2 is only corrected when something goes wrong, because the shell does a better job of displaying the file name.

Note that this does not make the script any worse. If you need to do actual string manipulation on the input data, you switch to unicode manipulation in 2.x and 3.x. But in that case, you also want your script to support a -- charset parameter, so it does about as much work on 2.x as it does on 3.x. It's just much worse on 3.x, you need to build the binary standard output on 2.x that you don't need.

But you're wrong

I was obviously wrong. I was told this:

I suffer because I don't think like a beginner, and the new unicode system will be more beginner friendly I don't think about how much Windows users and the new text model will improve Windows users The problem is not Python, the problem is the POSIX specification Linux distributions need to start supporting c.u.tf-8 because they have been blocked in the past The problem is that SSH is sending the wrong encoding. SSH needs to fix this. The real problem with the large number of unicode errors in Python3 is that people assume that Python3 made the right decision without passing an explicit code. I work with broken code, which is obviously harder in Python3. I should be working on Python3 instead of complaining on twitter and blogs You create problems where there are no problems. It's good to have everyone fix their environment and code anything. This is a user problem. Java has had this problem for years, which is fine with developers.

You know what? I stopped complaining when I was working on HTTP because I accepted the idea that a bunch of problems with HTTP/WSGI were common to people. But what do you know? The same problem applies in situations like Hello World. Maybe I should give up on getting a library with high quality unicode support and leave it at that.

I can argue against that, but in the end it doesn't matter. If Python3 were the only Python language I used, I would solve all my problems and develop with it. There's another perfect language called Python2, which has a much larger user base, and the user base is pretty solid. I was very depressed.

Python3 may be powerful enough to start UNIX going the way of Windows: unicode is used in many places, but I doubt it.

The more likely thing is that people are still using Python2 and doing something really bad with Python3. Or they use Go. The language USES a model very similar to Python2: everything is a byte string. And assume that the encoding is utf-8. That's all.