Python weather forecast collector implementation code of web crawler

  • 2020-04-02 09:36:09
  • OfStack

The crawler simply consists of two steps: obtaining the web text and filtering the data.
1. Get the HTML text.
Python is handy for getting HTML, and a few lines of code do what we need.
 
def getHtml(url): 
page = urllib.urlopen(url) 
html = page.read() 
page.close() 
return html 

These lines of code are supposed to give you an idea of what it means without comments.

2. Get the required content according to regular expressions, etc.

When using regular expressions, you need to carefully observe the structure of the page information and write the correct regular expression.
Python regular expressions are also simple to use. My last article, "(link: #)," introduced a little bit of regular usage. Here's a new usage:
 
def getWeather(html): 
reg = '<a title=.*?>(.*?)</a>.*?<span>(.*?)</span>.*?<b>(.*?)</b>' 
weatherList = re.compile(reg).findall(html) 
return weatherList 

Where reg is the regular expression and HTML is the text obtained in the first step. What findall does is findall the strings in the HTML that match the regular match and put them in the weatherList. Then enumerate the data output in weatheList.
There are two things to note about the regular expression reg here.
One is "(. *?) ". As long as everything in () is what we're going to get, if there are multiple parentheses, then every result of findall will contain those parentheses. There are three brackets above, corresponding to the city, the lowest temperature, the highest temperature.
The other is ".*?" . Python's regular matching default is greedy, meaning that the default matches as many strings as possible. If a question mark is added at the end, it indicates a non-greedy pattern that matches as few strings as possible. In this case, since there are multiple cities whose information needs to be matched, the non-greedy pattern needs to be used, otherwise only one match is left and it is incorrect.

Python is really handy :)

Related articles: