Python Chinese string processing code

  • 2020-04-02 09:31:24
  • OfStack

> > > Teststr = 'my eclipse cannot decode GBK code correctly! '
> > > teststr
'\ xe6 \ x88 \ x91 \ xe7 \ x9a \ x84eclipse \ xe4 \ xb8 \ x8d \ xe8 \ x83 \ XBD \ xe6 \ xad \ xa3 \ xe7 \ xa1 \ xae \ xe7 \ x9a \ x84 \ xe8 \ xa7 \ xa3 \ xe7 \ xa0 \ x81gbk \ xe7 \ xa0 \ x81 \ xef \ XBC \ x81'
> > > Tests2 = u' my eclipse cannot decode GBK code correctly! '
> > > Test3 = tests2. Encode (' gb2312)
> > > test3
'\ xce \ xd2 \ xb5 \ xc4eclipse \ sets \ XBB \ xc4 \ XDC \ xd5 \ XFD \ xc8 \ xb7 \ xb5 \ xc4 \ XBD \ xe2 \ xc2 \ xebgbk \ xc2 \ xeb \ xa3 \ xa1'
> > > test3
'\ xce \ xd2 \ xb5 \ xc4eclipse \ sets \ XBB \ xc4 \ XDC \ xd5 \ XFD \ xc8 \ xb7 \ xb5 \ xc4 \ XBD \ xe2 \ xc2 \ xebgbk \ xc2 \ xeb \ xa3 \ xa1'
> > > teststr
'\ xe6 \ x88 \ x91 \ xe7 \ x9a \ x84eclipse \ xe4 \ xb8 \ x8d \ xe8 \ x83 \ XBD \ xe6 \ xad \ xa3 \ xe7 \ xa1 \ xae \ xe7 \ x9a \ x84 \ xe8 \ xa7 \ xa3 \ xe7 \ xa0 \ x81gbk \ xe7 \ xa0 \ x81 \ xef \ XBC \ x81'
> > > Test3. Decode (' gb2312) encode (' utf-8)
'\ xe6 \ x88 \ x91 \ xe7 \ x9a \ x84eclipse \ xe4 \ xb8 \ x8d \ xe8 \ x83 \ XBD \ xe6 \ xad \ xa3 \ xe7 \ xa1 \ xae \ xe7 \ x9a \ x84 \ xe8 \ xa7 \ xa3 \ xe7 \ xa0 \ x81gbk \ xe7 \ xa0 \ x81 \ xef \ XBC \ x81'
> > > Test3. Decode (' gb2312) encode (' utf-8) = = teststr
True,
As seen above, the test3 variable (gb2312 encoded) is decoded (becomes a unicode string) and then encoded in utf-8 to be the same string as the teststr value.

In the above example, we also found that unicode strings are a bridge between gb2312 strings (which Windows USES) and utf-8 strings (which python itself USES).

Related articles: