asp. net URL encoding and decoding

2020-05-16 06:41:55
OfStack

For example, Url parameter string USES the form key=value key-value pair to pass arguments, and the key-value pair is separated by ampersand, such as /s? q = abc & ie = utf - 8. If your value string contains = or &, the server that receives Url will parse it incorrectly, so you will have to escape the ambiguous & and = symbols, that is, encode them.

For example, Url USES ASCII code instead of Unicode, which means you can't include any non-ASCII characters in Url, such as Chinese. Otherwise, Chinese may cause problems if the client browser and the server browser support different character sets.

The principle of Url encoding is to use safe characters (printable characters with no special purpose or meaning) to represent unsafe characters.

Preliminary knowledge: URI is the same as URI, and Url is just one kind of URI. A typical Url format is shown above. The Url encodings mentioned below actually refer to the URI encodings.


foo://example.com:8042/over/there?name=ferret#nose 

\_/ \______________/ \________/\_________/ \__/ 

scheme authority path query fragment

Which characters need to be encoded

The documentation for RFC3986 states that only letters (a-zA-Z), Numbers (0-9), -_.~4 special characters and all reserved characters are allowed in Url. The RFC3986 document provides detailed advice on the encoding and decoding of Url, indicating which characters need to be encoded so as not to cause a semantic change in Url, and explaining why these characters need to be encoded.

There is no corresponding printable character in the US-ASCII character set: only printable characters are allowed in Url. The 10-7F bytes in the US-ASCII code all represent control characters, none of which appear directly in Url. Also, the 80-FF byte (ISO-8859-1) cannot be placed in Url because it is beyond the byte range defined by US-ACII.

Reserved characters: Url can be divided into several components, protocol, host, path, etc. There are 1 characters (:/? #[]@) is used to separate the different components. For example, the colon is used to separate the protocol from the host, / to separate the host from the path,? Used to separate path and query parameters, and so on. And 1 character (! * + $& '(),; =) used to separate within each component, such as = to represent key-value pairs in query parameters, and & to separate multiple key-value pairs in a query. When common data in a component contains these special characters, they need to be encoded.

The following characters are specified as reserved characters in RFC3986:! * '(); : @ & = + $, /? # []

Unsafe characters: there are also some characters that, when placed directly in Url, can cause ambiguity in the parser. These characters are considered unsafe for a number of reasons.

Whitespace: when Url is in transit, or when the user is in typesetting, or when the text processor is in Url, it is possible to introduce insignificant whitespace, or to remove meaningful whitespace.
Quotes and < > : quotation marks and Angle brackets are often used to separate Url in plain text
# : usually used to represent bookmarks or anchors
% : the percent sign itself is used as a special character for encoding unsafe characters, so the encoding itself is required
{}|\^[] '~ : some gateway or transport agent will tamper with these characters
It is important to note that for the legal characters in Url, encoding and non-encoding are equivalent, but for the characters mentioned above, if they are not encoded, they may cause a difference in the semantics of Url. Therefore, for Url, there are only common English characters and Numbers, and special characters $-_.+! *'() also has reserved characters to appear in the unencoded Url. All other characters need to be encoded before they appear in Url.

However, due to historical reasons, there are still some non-standard coding implementations. For example, for the ~ symbol, although the RFC3986 document states that no Url encoding is required for the tilde ~, there are still many old gateways or transport agents.

How do I encode illegal characters in Url

The Url code is often referred to as the percent code (Url Encoding, also known as percent-encoding) because it is so simple to encode, using the % percent sign plus two characters -- 0123456789ABCDEF -- representing the base 106 form of a byte. The default character set for Url encoding is US-ASCII. In US - for example a ASCII code in the corresponding bytes is 0 x61, so after get Url codes is % 61, we input in the address bar http: / / g cn/search? q=%61%62%63, which is actually equivalent to searching for abc on google. Another example is that the @ symbol in the ASCII character set corresponds to a byte of 0x40, and after the encoding of Url, the result is %40.

For non-ASCII characters, the superset of the ASCII character set needs to be encoded to get the corresponding byte, and then the percent code is performed for each byte. For the Unicode character, the RFC document recommends encoding it with utf-8 to get the appropriate bytes, and then performing percent encoding for each byte. For "Chinese", the byte obtained using the UTF-8 character set is 0xE4 0xB8 0x6 0x87, and "%E4%B8% E6%96%87" after the Url encoding.

If a byte corresponds to a non-reserved character in the ASCII character set, the byte does not need to be represented by a percent sign. For example, "Url encoding ", the byte obtained using UTF-8 encoding is 0x55 0x72 0x6C 0xE7 0xBC 0xA0 0x81. Since the first three bytes correspond to the non-reserved character "Url" in ASCII, these three bytes can be represented by the non-reserved character "Url". The final Url code can be simplified to "Url% E7% BC%96% E7% A00% 81", of course, if you use "%55%72%6C% E7% BC%96% A01% ".

For historical reasons, there are a number of Url coding implementations that do not fully follow this principle, as described below.

The difference between escape,encodeURI and encodeURIComponent in Javascript

Three pairs of functions are provided in Javascript to encode Url to obtain the legal Url, which are escape/unescape,encodeURI/decodeURI and encodeURIComponent/decodeURIComponent. Since the decoding and encoding process is reversible, only the encoding process is explained here.

These three encoded functions -- escape, encodeURI, encodeURIComponent -- are all used to convert the unsafe and illegal Url character into the legal Url character representation, with the following differences.

Different security characters:

The following lists the safe characters for these three functions (that is, the function does not encode these characters)

escape (69 pieces) : */@+ -._0-9a-zA-Z
encodeURI (82) :! # $& '() *, + / :; =? @ -. _ ~ 0-9 a zA -- Z
encodeURIComponent (71 pieces) :! '() * -. _ ~ 0-9 a zA -- Z
Compatibility is different: the escape function has existed since Javascript 1.0, and the other two functions were introduced at Javascript 1.5. However, since Javascript1.5 is already very popular, there is actually no compatibility problem with encodeURI and encodeURIComponent.

The encoding of Unicode characters is different: these three functions have the same encoding of ASCII characters, which are all represented by the percent sign + two base 106 characters. However, for the Unicode character, escape is encoded as %uxxxx, where xxxx is used to represent the 4-bit hexadecimal character of the unicode character. This method has been abandoned by W3C. However, this encoding syntax of escape is still retained in the ECMA-262 standard. encodeURI and encodeURIComponent encode non-ASCII characters using UTF-8, followed by percent encoding. This is recommended by RFC. Therefore, it is recommended to use these two functions instead of escape to encode as much as possible.

The situation is different: encodeURI is used to encode a complete URI, and encodeURIComponent is used to encode a component of URI. From the table of safe character ranges mentioned above, we can see that encodeURIComponent encodes a larger character range than encodeURI. As mentioned above, the reserved character 1 is generally used to separate URI components (an URI can be cut into multiple components, refer to section 1 of preliminary knowledge) or subcomponents (such as the query parameter separator in URI), such as the: number used to separate scheme from the host,? Number is used to separate the host from the path. Since the object that encodeURI manipulates is a full URI, these characters already have a special purpose in URI, so these reserved characters will not be encoded by encodeURI, otherwise the meaning will change.

Components have their own data presentation formats, but these data must not contain reserved characters separating components internally, otherwise the separation of components in URI as a whole will be messy. So for a single component using encodeURIComponent, there are more characters to encode.

The form submission

When an Html form is submitted, each form field is encoded by Url before it is sent. For historical reasons, the Url encoding implementation used by the form does not meet the latest standards. For example, the code used for whitespaces is not %20, but a + sign. If the form is submitted using the Post method, we can see in the HTTP header that there is an Content-Type header with the value application/ x-www-form-urlencoded. Most applications can handle this non-standard implementation of Url encoding, but on the client side of Javascript, there is no function that decodes the + sign into a space and can only write its own conversion function. Also, for non-ASCII characters, the coded character set used depends on the character set used in the current document. For example, we add Html to the header

< meta http-equiv="Content-Type" content="text/html; charset=gb2312" / >

The browser will then render the document using gb2312 (note that if the meta tag is not set in the HTML document, the browser will automatically select a character set based on the current user's preferences, and the user can force the current web site to use a specified character set). When the form is submitted, the Url encodings are gb2312.

One of the confusing problems I encountered when using Aptana was that when using encodeURI, I found that the result it encoded was quite different from what I had expected. Here's my sample code:

 
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
<head> 
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> 
</head> 
<body> 
<script type="text/javascript"> 
document.write(encodeURI(" Chinese ")); 
</script> 
</body> 
</html>

The result of the operation is output % E 6% B 6%93%EE%85%9F% E 6%9E%83. Obviously this is not the result of Url encodings using the UTF-8 character set (search for "Chinese" on Google, Url shows % E4% B8% AD% E6% 96%87).

So I was skeptical at the time that encodeURI was still related to page encoding, but I found that normally if you used gb2312 for Url you wouldn't get this result. Finally, I found out that the problem was caused by the character set used for page file storage and the character set specified in the Meta tag. The Aptana editor USES the UTF-8 character set by default. This means that the file is actually stored using the UTF-8 character set. However, since the Meta tag specifies gb2312, the browser will parse the document according to gb2312, and then the "Chinese" string will make an error, because the "Chinese" string is encoded with UTF-8, and the resulting byte is 0xE4 0xB6 0x96 0x87, and these 6 bytes will be decoded by the browser with gb2312. Then you get another three Chinese characters "trickling" (one Chinese character takes two bytes in GBK), which, when passed into the encodeURI function, give you % E6% B6% 93%EE%85%9F% E6% 9E%83. Therefore, encodeURI still USES UTF-8 and is not affected by the page character set.

Different browsers behave differently when handling Url, which includes Chinese. For example, for IE, if you check the advanced setting "always send Url with UTF-8", the Chinese part of the path in Url will use UTF-8 for Url encoding and then send to the server, while the Chinese part of the query parameter will use the system default character set for Url encoding. To ensure maximum interoperability, it is recommended that all components placed in Url explicitly specify a character set for Url encoding, independent of the browser's default implementation.

In addition, many HTTP monitoring tools or the browser address bar will automatically decode Url once (using the UTF-8 character set) when displaying Url. This is why when you search for Chinese in Firefox, Url in the address bar will contain Chinese. But the original Url sent to the server is actually encoded. You can access location.href from the address bar using Javascript. Don't be fooled by these illusions when studying Url codec.