Python: The Item Pipeline component in the Scrapy framework USES detail
- 2020-06-19 10:55:19
- OfStack
Item Pipeline profile
The main responsibility of the Item pipeline is to process Item extracted from web pages by spiders. Its main task is to clear, verify and store data.
When the page is parsed by the spider, it is sent to the Item pipeline and processed in several specific sequences.
Each component of the Item pipeline is an Python class consisting of one simple method.
They take Item and execute their method, and they also need to determine whether they need to proceed to the next step in the Item pipeline or simply discard it without processing.
The PROCEDURES that the Item pipeline usually performs are
Clean up HTML data
Validate the parsed data (check that Item contains the necessary fields)
Check for duplicate data (delete if duplicate)
Store the parsed data in the database
Write your own Item Pipeline
Writing the item pipeline is actually quite easy.
Each component of the Item pipeline is an Python class consisting of one simple method:
process_item(item, spider)
This method is called for every item pipe component and must return an instance of the item object or an raise DropItem exception.
The item discarded will not be executed in the pipeline component
In addition, we can implement the following methods in a class
open_spider(spider)
This method is called when spider executes
close_spider(spider)
This method is called when spider closes
Item Pipeline example
The code is as follows:
from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.15
def process_item(self, item, spider):
if item['price']:
if item['price_excludes_vat']:
item['price'] = item['price'] * self.vat_factor
return item
else:
raise DropItem("Missing price in %s" % item)
Note: VAT:ValueAddedTax(VAT)
The above code filters products that do not have a price and adjusts prices for those that do not include VAT
Save the captured items in json format to a file
items fetched from spider is serialized to json and written to items.jl as 1 item per line
Code:
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('items.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Note: THE purpose of JsonWriterPipeline is to show how to write project pipelines. Feedexports is recommended if you want to save the captured items file to json
Delete duplicates
Assuming that the item extracted from spider has duplicate id, then we can filter in the process_item function
Such as:
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Activate the ItemPipeline component
In the ES114en.py file, add the class name of the project pipeline to ITEM_PIPELINES to activate the project pipeline component
Such as:
ITEM_PIPELINES = {
'myproject.pipeline.PricePipeline': 300,
'myproject.pipeline.JsonWriterPipeline': 800,
}
The integer values you assign to classes in this setting determine the order they run in- items go through pipelines from order number low to high
Integer values are usually set between 0 and 1000
conclusion
That is the end of this article on the use of Item Pipeline components in the Python: Scrapy framework. Those who are interested can continue to see this site:
Python USES Scrapy to save console information to text parsing
Python crawler instance crawls the site's funny jokes
The Python crawler gets code samples for all external links throughout the site
If there is any deficiency, please let me know. Thank you for your support!