The Python crawler realizes the example function of grasping the information of JD store and downloading pictures

  • 2020-12-07 04:05:48
  • OfStack

The example of this paper describes the Python crawler to capture the information of JINGdong store and download pictures. To share for your reference, the details are as follows:

This is grabbing the information


from bs4 import BeautifulSoup
import requests
url = 'https://list.tmall.com/search_product.htm?q=%CB%AE%BA%F8+%C9%D5%CB%AE&type=p&vmarket=&spm=875.7931836%2FA.a2227oh.d100&from=mallfp..pc_1_searchbutton'
response = requests.get(url)                          # Parse web pages 
soup = BeautifulSoup(response.text,'lxml')                   #.text Will parse to the web page readable 
storenames = soup.select('#J_ItemList > div > div > p.productTitle > a')    # Select the store information 
prices = soup.select('#J_ItemList > div > div > p.productPrice > em')     # Select the price information 
sales = soup.select('#J_ItemList > div > div > p.productStatus > span > em')  # Select the sales information 
for storename, price, sale in zip(storenames,prices,sales):
  storename = storename.get_text().strip()   # with get_text() Method to filter the text information in the tag, because the filter result has a newline character \n So use strip() Remove the newline character 
  price = price.get_text()
  sale = sale.get_text()
  print(' The shop name :%-40s The price :%-40s sales :%s'%(storename,price,sale))   # Specification of printed information 
  print('----------------------------------------------------------------------------------------------')

This is for downloading pictures


from bs4 import BeautifulSoup
import requests
import urllib.request
url = 'https://list.tmall.com/search_product.htm?q=%CB%AE%BA%F8+%C9%D5%CB%AE&type=p&vmarket=&spm=875.7931836%2FA.a2227oh.d100&from=mallfp..pc_1_searchbutton'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
imgs = soup.select('#J_ItemList > div > div > div.productImg-wrap > a > img')
a = 1
for i in imgs:
  if(i.get('src')==None):
    break
  img = 'http:'+i.get('src') # It's been a long time since the original site had to be there http The: 
  #print(img)
  urllib.request.urlretrieve(img,'%s.jpg'%a, None,)
  a = a+1

ps:

1. Use css when selecting information

2. Use get_text() Method To filter the text information in the tag

3. strip . lstrip . rstrip Usage:

In the Python strip The first and last characters used to remove strings; In the same way, lstrip Used to remove characters on the left; rstrip Used to remove characters on the right.

Each of the three functions can pass in an argument that specifies the first and last characters to be removed.

It should be noted that what is passed in is a 1-character array, and the compiler removes all corresponding characters on both ends until there are no matching characters, such as:


theString = 'saaaay yes no yaaaass'
print theString.strip('say')

theString in turn is removed head and tail at ['s','a','y'] Characters in an array until the characters are not in an array. Therefore, the output result is:

[

yes no

]

It's a little bit easier, lstrip and rstrip The principle is one.

Note: When no parameters are passed in, the first and last Spaces and newline characters are removed by default.


theString = 'saaaay yes no yaaaass'
print theString.strip('say')
print theString.strip('say ') #say There's a space after that 
print theString.lstrip('say')
print theString.rstrip('say')

Operation results:

[

yes no
es no
yes no yaaaass
saaaay yes no

]

More about Python related content to view this site project: the Python Socket programming skills summary ", "Python regular expression usage summary", "Python data structure and algorithm tutorial", "Python function using techniques", "Python string skills summary", "Python introduction and advanced tutorial" and "Python file and directory skills summary"

I hope this article has been helpful for Python programming.


Related articles: