Python crawler frame Scrapy installation using steps

  • 2020-04-02 13:34:45
  • OfStack

1. Introduction to crawler framework Scarpy
Scrapy is a quick high-level screen scraping and web crawler frame, crawl site, structured data from web pages, it has a wide range of USES, from data mining to monitoring and automatic testing, Scrapy completely with the Python implementation, fully open source, code hosted on making, can run on Linux, Windows, Mac and BSD platform, based on the Twisted asynchronous network library to deal with network communication, user only needs to custom development several modules can easily achieve a crawler, used for scraping of the page content as well as a variety of images.

Scrapy installation guide

Our installation steps assume that you have already installed the content: < 1 > Python2.7 < 2 > LXML < 3 > OpenSSL, we use Python's package management tool PIP or easy_install to install Scrapy.
Installation method of PIP:

pip install Scrapy

How to install easy_install:
easy_install Scrapy

Environment configuration on the Ubuntu platform

1. Python package management tool
The current package management tool chain is easy_install/ PIP + distribute/setuptools
Distutils: Python's own basic installation tool for very simple application scenarios;
Setuptools: a number of extensions have been made to distutils, in particular the inclusion of a package dependency mechanism.
Distribute: due to the slow development progress of setuptools, Python 3 was not supported, and the code was in chaos. A group of programmers started their own business, reconstructing the code and adding functions. They hoped to replace setuptools and be accepted as the official standard library. , setuptools/distribute just extends distutils;
Easy_install: setuptools and distribute come with installation scripts, that is, once setuptools or distribute are installed, easy_install is also available. Use:
PIP: PIP is very targeted to replace easy_install. Easy_install has many disadvantages: the installation transaction is non-atomic, only supports SVN, no uninstall command is provided, scripts need to be written when installing a series of packages; PIP has become the new DE facto standard by solving the above problems, and virtualenv has become a good partner with it.

Installation process:
Install distribute    

$ curl -O http://python-distribute.org/distribute_setup.py  
$ python distribute_setup.py

PIP installation:
$ curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py  
$ [sudo] python get-pip.py

2. Mounting of Scrapy
On the Windows platform, you can download various dependent binary packages either through the package management tool or manually: Pywin32, Twisted, zope interface, LXML, pyOpenSSL, after Ubuntu9.10 on the version of the official recommendation without using Ubuntu python - scrapy package, they are either too old or too slow, cannot match the latest scrapy, solution is to use the official Ubuntu Packages, it provides all the dependent libraries, and continuously update to the latest bug, higher stability, Continuously built from the Github repository (master and stable branches), Scrapy has been installed on ubuntu since 9.10 as follows:
< 1 > Enter the GPG key

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7

< 2 > Create the/etc/apt/sources. List. D/scrapy. List file
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list

< 3 > Update the package list to install a scrapy VERSION, where VERSION is replaced by an actual VERSION, such as scrapy-0.22
sudo apt-get update && sudo apt-get install scrapy-VERSION

3. Scrapy dependent library installation
Scrapy dependent library installation under ubuntu12.04
ImportError: No module named w3lib.http

pip install w3lib

ImportError: No module named twisted
pip install twisted

ImportError: No module named lxml.html
pip install lxml

Solution: error: libxml/xmlversion. H: No such file or directory

apt-get install libxml2-dev libxslt-dev  
apt-get install python-lxml

Solution: ImportError: No module named cssselect

pip install cssselect  

ImportError: No module named OpenSSL
pip install pyOpenSSL  

4. Customized crawler development
Switch to the file directory and start the new project

scrapy startproject test 

 


Related articles: