Python: The Item Pipeline component in the Scrapy framework USES detail

2020-06-19 10:55:19
OfStack

Item Pipeline profile

The main responsibility of the Item pipeline is to process Item extracted from web pages by spiders. Its main task is to clear, verify and store data.
When the page is parsed by the spider, it is sent to the Item pipeline and processed in several specific sequences.
Each component of the Item pipeline is an Python class consisting of one simple method.
They take Item and execute their method, and they also need to determine whether they need to proceed to the next step in the Item pipeline or simply discard it without processing.

The PROCEDURES that the Item pipeline usually performs are

Clean up HTML data
Validate the parsed data (check that Item contains the necessary fields)
Check for duplicate data (delete if duplicate)
Store the parsed data in the database

Write your own Item Pipeline

Writing the item pipeline is actually quite easy.
Each component of the Item pipeline is an Python class consisting of one simple method:

process_item(item, spider)

This method is called for every item pipe component and must return an instance of the item object or an raise DropItem exception.
The item discarded will not be executed in the pipeline component
In addition, we can implement the following methods in a class

open_spider(spider)

This method is called when spider executes

close_spider(spider)

This method is called when spider closes
Item Pipeline example

The code is as follows:


from scrapy.exceptions import DropItem 
 
class PricePipeline(object): 
 
  vat_factor = 1.15 
 
  def process_item(self, item, spider): 
    if item['price']: 
      if item['price_excludes_vat']: 
        item['price'] = item['price'] * self.vat_factor 
      return item 
    else: 
      raise DropItem("Missing price in %s" % item)

Note: VAT:ValueAddedTax(VAT)

The above code filters products that do not have a price and adjusts prices for those that do not include VAT

Save the captured items in json format to a file

items fetched from spider is serialized to json and written to items.jl as 1 item per line

Code:


import json 
 
class JsonWriterPipeline(object): 
 
  def __init__(self): 
    self.file = open('items.jl', 'wb') 
 
  def process_item(self, item, spider): 
    line = json.dumps(dict(item)) + "\n" 
    self.file.write(line) 
    return item

Note: THE purpose of JsonWriterPipeline is to show how to write project pipelines. Feedexports is recommended if you want to save the captured items file to json

Delete duplicates

Assuming that the item extracted from spider has duplicate id, then we can filter in the process_item function

Such as:


from scrapy.exceptions import DropItem 
 
class DuplicatesPipeline(object): 
 
  def __init__(self): 
    self.ids_seen = set() 
 
  def process_item(self, item, spider): 
    if item['id'] in self.ids_seen: 
      raise DropItem("Duplicate item found: %s" % item) 
    else: 
      self.ids_seen.add(item['id']) 
      return item

Activate the ItemPipeline component

In the ES114en.py file, add the class name of the project pipeline to ITEM_PIPELINES to activate the project pipeline component

Such as:


ITEM_PIPELINES = { 
  'myproject.pipeline.PricePipeline': 300, 
  'myproject.pipeline.JsonWriterPipeline': 800, 
}

The integer values you assign to classes in this setting determine the order they run in- items go through pipelines from order number low to high

Integer values are usually set between 0 and 1000

conclusion

That is the end of this article on the use of Item Pipeline components in the Python: Scrapy framework. Those who are interested can continue to see this site:

Python USES Scrapy to save console information to text parsing

Python crawler instance crawls the site's funny jokes

The Python crawler gets code samples for all external links throughout the site

If there is any deficiency, please let me know. Thank you for your support!