PHP utf 8 Unicode and BOM issues

  • 2020-03-31 20:50:16
  • OfStack

Introduce a,

Utf-8 is a Unicode character encoding method commonly used in web applications. The advantage of using utf-8 is that it is a variable-length encoding method. For ANSII code, the encoding length is 1 byte.
The utf-8 signature, also known as a BOM (Byte Order Mark), is the standard Mark used in UTF encoding schemes to identify the encoding. BOM is the standard mark used to identify codes in UTF coding scheme. In utf-16, it was originally FF FE, but when it became utf-8, it became EF BB BF. This token is optional because the UTF8 bytes have no order, so it can be used to detect whether a byte stream is utf-8 encoded. Microsoft does this, but some software doesn't do this, treating it as a normal character. Microsoft adds three bytes of EF BB BF before its utf-8 text file. Notepad and other programs on Windows use these three bytes to determine whether a text file is ASCII or utf-8. However, this is just a secret mark made by Microsoft. That is, a utf-8 file may or may not have a BOM.
There is only one BOM, there will be no problem. If multiple files are signed, multiple utf-8 signatures are included in the binary stream, which is the "root element must be well-formed" cause of the XML conversion failure.

Two, view and transform

Since a utf-8 file may or may not have a BOM, how do you tell the difference?
Simply use a program with hexadecimal editing, such as ultraedit-32, open the file, switch to hexadecimal editing mode, and see if the file header has EF BB BF. If yes, it is with BOM.
Notepad, which comes with Windows, comes with a BOM by default when saved as utf-8.
There are many conversion methods, the common ultraedit-32 or NotePad++ can be, take ultraedit-32 as an example. After opening the file, select "save as". In the "format" column, there are the following choices:

(link: http://www.linuxfly.org/attachment.php? Fid = 573) < img border = 0 SRC = "http://files.jb51.net/upload/201005/20100518234951351.gif" border = 0 >

In addition, DreamWeaver CS3 has a similar option. In preferences, if you select Unicode (utf-8) as the default encoding, you can select the include Unicode signature (BOM) option to include the byte-order notation (BOM) in the document. Otherwise, no BOM:
< img border = 0 SRC = "http://files.jb51.net/upload/201005/20100518235004897.gif" border = 0 >
Other knowledge
Understands from (link: http://blog.csdn.net/thimin/archive/2007/08/03/1724393.aspx) is:
The so-called unicode saved file is actually utf-16, which happens to be the same as the unicode code, but unicode is conceptually different from utf, where unicode is the memory-encoded representation scheme, and utf is the storage and transmission scheme for unicode. Utf-16 can also BE classified into two types: first (LE) and second (BE). The official utf code also has utf-32, also known as LE and BE. The non-official utf encoding for unicode is also utf-7, which is mainly used for mail transmission. The single-byte portion of utf-8 is compatible with iso-8859-1, which was largely forced out by older systems and library functions that did not properly handle utf-16, and also saves file space for English characters (at the expense of non-english characters). At iso-8859-1, both utf8 and iso-8859-1 are represented in one byte, and utf-8 USES two or three bytes when representing other characters.

A more detailed description of BOM from (link: #) :
In UCS encoding there is a character called "ZERO WIDTH no-break SPACE", which is encoded as FEFF. FFFE is a nonexistent character in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the character "ZERO WIDTH no-break SPACE" before transferring a byte stream. This indicates that the byte stream is big-endian if the recipient receives FEFF; If FFFE is received, it indicates that the byte stream is little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.
Utf-8 does not require a BOM to indicate the byte order, but a BOM can be used to indicate the encoding. The utf-8 encoding of the character "ZERO WIDTH no-break SPACE" is EF BB BF. So if the receiver receives a byte stream beginning with EF BB BF, it knows it's utf-8.
Windows USES BOM to mark the encoding of text files.

PHP also does not support BOM.
PHP is not designed with the BOM in mind, which means it doesn't ignore the three characters at the beginning of the BOM in utf-8 encoded files. Because must be in < ? Or < ? The code behind PHP is executed as PHP code, so these three characters will be printed directly. If the plug-in file has this problem, it will cause the plug-in to be activated or not activated in the background page after the white screen, if the template file has this problem, it will cause the three characters directly output, resulting in a small blank line at the top of the page. Foreign English plug-ins and templates generally use ASCII code encoding, there is no BOM, only domestic plug-ins and templates will cause problems due to the author's ignorance. Also, when you modify the template, as a result of the output page USES utf-8 encoding, then modify the template if you have to join the Chinese characters, must put the files into utf-8 to display properly, if this time used by the editor automatically added to the BOM, will cause the output of the three characters on a page, shows that the effect depends on the browser, the general is a blank line or a garbled.
does one more thing: especially when using PHP to import templates, it is easier to browse for exceptions because of these three characters. (link: http://www.linuxfly.org/attachment.php? Fid = 574).

Related articles: