Resolve Chinese coding problems in php development

  • 2020-08-22 21:55:36
  • OfStack

The problem of Chinese coding in PHP programming used to bother many people. The reason for this problem is actually very simple. Every country (or region) has a set of character codes for computer information exchange, such as the extended ASCII code in the United States, GB2312-80 in China, JIS in Japan, etc. As the basis of information processing in this country/region, character encoding set plays an important role in unifying 1 encoding. The character encoding set is divided into SBCS(single-byte character set) and DBCS(double-byte Character set) by length. Early software (especially operating systems), in order to solve the computer processing of local character information, there are various localized versions (L10N), in order to distinguish, the introduction of LANG, Codepage and other concepts. However, due to the overlap of local character set codes, it is difficult to exchange information among them. The independent maintenance cost of each localized version of software is high. Therefore, it is necessary to extract commonalities from localization work and do 1 processing to minimize the special localization processing content. This is known as internationalization (118N). The various language information is further normalized to Locale information. The underlying character set being processed becomes Unicode, which contains almost all glyphs.

Nowadays, most of the core character processing of software with internationalization characteristics is based on Unicode, and the corresponding local character encoding Settings are determined according to the ocale/Lang/Codepage setting at the time of software running, and local characters are processed accordingly. It is necessary to realize the conversion between Unicode and the local character set, or even between the two different local character sets with Unicode as the middle. This approach is further extended in the network environment, and the character information at either end of the network also needs to be converted into acceptable content according to the set of character set.

Character set coding problem in database
Popular relational database systems support database character set encoding, which means that the database can be created by specifying its own character set, and the database data is stored in the specified encoding form. When an application accesses data, there is a chart-coded conversion at both the entry and exit points. For Chinese data, the database character encoding should be set to ensure the integrity of the data. GB2312, GBK, UTF-8, etc. are all optional database character set encodings. Of course, we can also choose ISO8859-1 (8-ES29en), but we have to respond

Before using the program to write data, the one Chinese character of 16Bit or Unicode is divided into two characters of 8-ES34en. After reading the data, the two bytes need to be combined and the SBCS character among them needs to be identified. Therefore, we do not recommend using ISO8859-1 as the database character set encoding. This not only does not make full use of the database's own character set encoding support, but also increases the complexity of programming. When programming, you can use the database management system to provide management functions to check whether the Chinese data is correct.

The PHP program executes mysql_query("SET NAMES xxxx") before querling the database. xxxx is the code of your web page (charset=xxxx), if the web page charset=utf8, then xxxx=utf8, if the web page charset=gb2312, then xxxx=gb2312, almost all WEB procedures, there is a section of the public code to connect to the database, put in a file, in this file, Add mysql_query("SET NAMES xxxx").

SET NAMES shows which character set is used in the SQL statement sent by the client. Therefore, the SET NAMES 'ES69en-8' statement tells the server to "use the character set utf-8 for future information from this client". It also specifies the character set for the result that the server sends back to the client (for example, if you use an SELECT statement, which character set is used for the column value).

A technique often used to locate problems
The clumsiest and most efficient way to locate Chinese coding problems is to print the inner code of a string after it has been processed by the program you suspect. By printing the code of the string, you can find out when Chinese characters are converted to Unicode, when Unicode is converted to Chinese code, when a Chinese character becomes two Unicode characters, when A Chinese string is converted to a question mark, when the high part of the Chinese string is truncated...

Getting the appropriate sample string also helps to distinguish the type of problem. "aa, aa? "@aa" and other Chinese and English characters, GB, GBK characteristic character all have strings. In general, no matter how English characters are converted or processed, there will be no distortion (if encountered, you can try to increase the length of consecutive English letters).

Solve messy code problems in various applications

1) use < meta http-equiv="content-type" content="text/html;charset=xxx" > The tag sets the page code
The purpose of this tag is to state what character set the client's browser will use to display the page. xxx can be GB2312, GBK, ES96en-8 (unlike MySQL, MySQL is UTF8), and so on. Therefore, most pages can be used in this way to tell the browser what code to use when displaying the page, so as not to cause coding errors and resulting in messy code. But sometimes it doesn't work. No matter what xxx is, the browser always USES one code, which I'll talk about later.

Please note that < meta > Is HTML information, just a statement that indicates that the server has sent HTML information to the browser.

2) header("content-type:text/html; charset=xxx");
The function header() sends the information in parentheses to the http header. If the contents of the parentheses are the same as what is said in the passage, then the function and label are basically the same, you look at the first character to find the same. But the difference is that if you have this function, the browser will always use the xxx code you asked for, and it will never be disobedient, so this function is very useful. Why is that? Here's the difference between the http header and the HTML message:

The http header is a string sent by the server before sending HTML information to the browser using the http protocol. The tag belongs to the HTML message, so the content sent by header() arrives in the browser first, and the layman's point is that header() has a higher priority < meta > I wonder if you could say so. Suppose an php page has both header(" ES130en-ES131en :text/html; charset=xxx"), the browser recognizes http instead of meta. Of course, this function can only be used within the php page.

It also leaves open the question of why the former is absolutely effective while the latter is sometimes not. That's why we're talking about Apache.

3) AddDefaultCharset
In the conf folder in the Apache root directory, there is the entire Apache configuration document httpd.conf.

Open ES153en.conf with a text editor, line 708 (different versions may differ) with AddDefaultCharset xxx and xxx as the encoding name. Set the xxx character set as your default xxx character set in the http header of the web file on the entire server. Having this line is equivalent to adding one line header to each file (" ES161en-ES162en :text/html; charset = xxx "). Now you see why < meta > utf-8 is set, which is why the browser always USES gb2312.

If you have header(" ES175en-ES176en :text/html; charset=xxx"), changes the default character set to the character set you set, so this function is always useful. If you prefix AddDefaultCharset xxx with a "#", comment it out, and the page does not contain header(" ES184en-ES185en... "). ), that's when the meta tag comes into play.

Here are the priorities listed above:
.. header("content-type:text/html; charset=xxx")
.. AddDefaultCharset xxx
.. < meta http-equiv="content-type" content="text/html;charset=xxx" >

If you are an web programmer, it is recommended to add an header(" content-ES203en :text/html; charset=xxx"), which ensures that it will display correctly on any server and is portable.

4) default_charset configuration in ES210en.ini:
default_charset = "gb2312" in ES215en.ini defines the default language character set of php. It is generally recommended to comment out this line and let the browser automatically select the language based on charset in the header instead of making a mandatory requirement, so that web pages in multiple languages can be served on the same server.

conclusion
In fact, the Chinese coding in php is not as complex as expected. Although the positioning and solving of the problem are not specified, and the operating environment is different, the principle behind it is the same. Knowledge of the character set is the basis for solving character problems. However, as the Chinese character set changes, not only the php programming, the problem of Chinese information processing will exist for some time.


Related articles: