The Python crawler realizes the example function of grasping the information of JD store and downloading pictures
- 2020-12-07 04:05:48
- OfStack
The example of this paper describes the Python crawler to capture the information of JINGdong store and download pictures. To share for your reference, the details are as follows:
This is grabbing the information
from bs4 import BeautifulSoup
import requests
url = 'https://list.tmall.com/search_product.htm?q=%CB%AE%BA%F8+%C9%D5%CB%AE&type=p&vmarket=&spm=875.7931836%2FA.a2227oh.d100&from=mallfp..pc_1_searchbutton'
response = requests.get(url) # Parse web pages
soup = BeautifulSoup(response.text,'lxml') #.text Will parse to the web page readable
storenames = soup.select('#J_ItemList > div > div > p.productTitle > a') # Select the store information
prices = soup.select('#J_ItemList > div > div > p.productPrice > em') # Select the price information
sales = soup.select('#J_ItemList > div > div > p.productStatus > span > em') # Select the sales information
for storename, price, sale in zip(storenames,prices,sales):
storename = storename.get_text().strip() # with get_text() Method to filter the text information in the tag, because the filter result has a newline character \n So use strip() Remove the newline character
price = price.get_text()
sale = sale.get_text()
print(' The shop name :%-40s The price :%-40s sales :%s'%(storename,price,sale)) # Specification of printed information
print('----------------------------------------------------------------------------------------------')
This is for downloading pictures
from bs4 import BeautifulSoup
import requests
import urllib.request
url = 'https://list.tmall.com/search_product.htm?q=%CB%AE%BA%F8+%C9%D5%CB%AE&type=p&vmarket=&spm=875.7931836%2FA.a2227oh.d100&from=mallfp..pc_1_searchbutton'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
imgs = soup.select('#J_ItemList > div > div > div.productImg-wrap > a > img')
a = 1
for i in imgs:
if(i.get('src')==None):
break
img = 'http:'+i.get('src') # It's been a long time since the original site had to be there http The:
#print(img)
urllib.request.urlretrieve(img,'%s.jpg'%a, None,)
a = a+1
ps:
1. Use css when selecting information
2. Use
get_text()
Method To filter the text information in the tag
3.
strip
.
lstrip
.
rstrip
Usage:
In the Python
strip
The first and last characters used to remove strings; In the same way,
lstrip
Used to remove characters on the left;
rstrip
Used to remove characters on the right.
Each of the three functions can pass in an argument that specifies the first and last characters to be removed.
It should be noted that what is passed in is a 1-character array, and the compiler removes all corresponding characters on both ends until there are no matching characters, such as:
theString = 'saaaay yes no yaaaass'
print theString.strip('say')
theString in turn is removed head and tail at
['s','a','y']
Characters in an array until the characters are not in an array. Therefore, the output result is:
yes no
]
It's a little bit easier,
lstrip
and
rstrip
The principle is one.
Note: When no parameters are passed in, the first and last Spaces and newline characters are removed by default.
theString = 'saaaay yes no yaaaass'
print theString.strip('say')
print theString.strip('say ') #say There's a space after that
print theString.lstrip('say')
print theString.rstrip('say')
Operation results:
[
yes no
es no
yes no yaaaass
saaaay yes no
More about Python related content to view this site project: the Python Socket programming skills summary ", "Python regular expression usage summary", "Python data structure and algorithm tutorial", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"
I hope this article has been helpful for Python programming.