Detail five ways to handle HTML escape characters with Python

  • 2020-06-19 10:56:45
  • OfStack

Write crawler is a process of sending request, extracting data, cleaning data and storing data. In this process, different data sources return data in different formats, including JSON and XML documents, but mostly HTML documents. HTML is often mixed with transfer characters that we need to escape into real characters.

What is an escape character

In HTML < , > , & Such characters have special meanings ( < . > Used in tags, & They cannot be used directly in the HTML code. To display these symbols on a web page, you need to use the HTML escape string (Escape Sequence), for example < The escape character is < When the browser renders the HTML page, it will automatically replace the transferred string with the real character.

The escape character (Escape Sequence) consists of three parts: Part 1 is 1 & Symbols, part 2 is the entity (Entity) name, and part 3 is a semicolon. For example, to display a less than sign ( < ), you can write < .

显示字符 说明 转义字符
< 小于 <
空格
< 小于 <
> 大于 >
& &符号 &
" 双引号 "
© 版权 ©
® 已注册商标 ®

Python anti-escape string

There are several ways to handle escape strings with Python, and the way is different in py2 and py3. In python2, the module for anti-escape strings is HTMLParser.


# python2
import HTMLParser
>>> HTMLParser().unescape('a=1&b=2')
'a=1&b=2'

The Python3 HTMLParser module was migrated to ES50en.ES51en


# python3
>>> from html.parser import HTMLParser
>>> HTMLParser().unescape('a=1&b=2')
'a=1&b=2'

The unescape method was added to the html module in later versions of python3.4.


# python3.4
>>> import html
>>> html.unescape('a=1&b=2')
'a=1&b=2'

The last one is recommended because HTMLParser.unescape has been deprecated since Python3.4 and is not recommended, meaning that later versions will be removed completely.

In addition, the sax module also has functions that support anti-escape


>>> from xml.sax.saxutils import unescape
>>> unescape('a=1&b=2')
'a=1&b=2'

Of course, you can do your own anti-escape, it's not complicated, and of course, we like not to duplicate the wheel.


Related articles: