Case Tutorial of Scrapy Environment Construction of Python Crawler

  • 2021-11-14 06:36:37
  • OfStack

Construction of Scrapy Environment of Python Crawler

How to Build Scrapy Environment

First, install the Python environment. For the construction of Python environment, see: https://blog.csdn.net/alice_tl/article/details/76793590

Next, install Scrapy

1. Install Scrapy and use pip install Scrapy at the terminal (note that it is best to use foreign environment)

The progress tips are as follows:


alicedeMacBook-Pro:~ alice$ pip install Scrapy
Collecting Scrapy
  Using cached https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl
Collecting w3lib>=1.17.0 (from Scrapy)
  Using cached https://files.pythonhosted.org/packages/37/94/40c93ad0cadac0f8cb729e1668823c71532fd4a7361b141aec535acb68e3/w3lib-1.19.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy)
 xxxxxxxxxxxxxxxxxxxxx
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/dist.py", line 380, in fetch_build_egg
        return cmd.easy_install(req)
      File "/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/setuptools/command/easy_install.py", line 632, in easy_install
        raise DistutilsError(msg)
    distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('incremental>=16.10.1')
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

There is an error prompt for missing Twisted:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-U_6VZF/Twisted/

2. Install Twiseted. Enter sudo pip install twisted==13. 1.0 in the terminal


alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
  Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
    100% | In fact, in fact, the | 2.7MB 398kB/s 
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
  Running setup.py install for twisted ... error
    Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.13-intel-2.7
    creating build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/_version.py -> build/li

3. Use sudo pip install scrapy to install again, and find that there is still an error prompt. This time, the error prompt is that lxml is not installed:

Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )

No matching distribution found for lxml (from Scrapy)


alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
  Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
    100% | In fact, in fact, the | 256kB 210kB/s 
Collecting w3lib>=1.17.0 (from Scrapy)
  xxxxxxxxxxxx
  Downloading https://files.pythonhosted.org/packages/90/50/4c315ce5d119f67189d1819629cae7908ca0b0a6c572980df5cc6942bc22/Twisted-18.7.0.tar.bz2 (3.1MB)
    100% | In fact, in fact, the | 3.1MB 59kB/s 
Collecting lxml (from Scrapy)
  Could not find a version that satisfies the requirement lxml (from Scrapy) (from versions: )
No matching distribution found for lxml (from Scrapy)

4. Install lxml and use: sudo pip install lxml


alicedeMacBook-Pro:~ alice$ sudo pip install lxml
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting lxml
  Downloading https://files.pythonhosted.org/packages/a1/2c/6b324d1447640eb1dd240e366610f092da98270c057aeb78aa596cda4dab/lxml-4.2.4-cp27-cp27m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (8.7MB)
    100% | In fact, in fact, the | 8.7MB 187kB/s 
Installing collected packages: lxml
Successfully installed lxml-4.2.4

5. Install scrapy again, using sudo pip install scrapy, and the installation is successful


alicedeMacBook-Pro:~ alice$ sudo pip install Scrapy
The directory '/Users/alice/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/alice/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting Scrapy
  Downloading https://files.pythonhosted.org/packages/5d/12/a6197eaf97385e96fd8ec56627749a6229a9b3178ad73866a0b1fb377379/Scrapy-1.5.1-py2.py3-none-any.whl (249kB)
    100% | In fact, in fact, the | 256kB 11.5MB/s 
Collecting w3lib>=1.17.0 (from Scrapy)
  xxxxxxxxx
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy) (4.2.4)
Collecting functools32; python_version < "3.0" (from parsel>=1.1->Scrapy)
  Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='pypi.org', port=443): Read timed out. (read timeout=15)",)': /simple/functools32/
  Downloading https://files.pythonhosted.org/packages/4b/2a/0276479a4b3caeb8a8c1af2f8e4355746a97fab05a372e4a2c6a6b876165/idna-2.7-py2.py3-none-any.whl (58kB)
    100% | In fact, in fact, the | 61kB 66kB/s 
Installing collected packages: w3lib, cssselect, functools32, parsel, queuelib, PyDispatcher, attrs, pyasn1-modules, service-identity, zope.interface, constantly, incremental, Automat, idna, hyperlink, PyHamcrest, Twisted, Scrapy
  Running setup.py install for functools32 ... done
  Running setup.py install for PyDispatcher ... done
  Found existing installation: zope.interface 4.1.1
    Uninstalling zope.interface-4.1.1:
      Successfully uninstalled zope.interface-4.1.1
  Running setup.py install for zope.interface ... done
  Running setup.py install for Twisted ... done
Successfully installed Automat-0.7.0 PyDispatcher-2.0.5 PyHamcrest-1.9.0 Scrapy-1.5.1 Twisted-18.7.0 attrs-18.1.0 constantly-15.1.0 cssselect-1.0.3 functools32-3.2.3.post2 hyperlink-18.0.0 idna-2.7 incremental-17.5.0 parsel-1.5.0 pyasn1-modules-0.2.2 queuelib-1.5.0 service-identity-17.0.0 w3lib-1.19.0 zope.interface-4.5.0

6. Check whether scrapy is installed successfully, and enter scrapy-version

The version information of scrapy appears, such as Scrapy 1.5. 1-no active project.


alicedeMacBook-Pro:~ alice$ scrapy --version
Scrapy 1.5.1 - no active project
 
Usage:
  scrapy <command> [options] [args]
 
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
 
  [ more ]      More commands available when run from project directory
 
Use "scrapy <command> -h" to see more info about a command

PS: If you do not have normal access to the org network and install with sudo administrator privileges, a similar error will appear

Exception:

Traceback (most recent call last):

  File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main

    status = self.run(options, args)

  File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run

    resolver.resolve(requirement_set)


Exception:
Traceback (most recent call last):
  File "/Library/Python/2.7/site-packages/pip/_internal/basecommand.py", line 141, in main
    status = self.run(options, args)
  File "/Library/Python/2.7/site-packages/pip/_internal/commands/install.py", line 299, in run
    resolver.resolve(requirement_set)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 102, in resolve
    self._resolve_one(requirement_set, req)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 256, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/Library/Python/2.7/site-packages/pip/_internal/resolve.py", line 209, in _get_abstract_dist_for
    self.require_hashes
  File "/Library/Python/2.7/site-packages/pip/_internal/operations/prepare.py", line 283, in prepare_linked_requirement
    progress_bar=self.progress_bar
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 836, in unpack_url
    progress_bar=progress_bar
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 673, in unpack_http_url
    progress_bar)
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 897, in _download_http_url
    _download_url(resp, link, content_file, hashes, progress_bar)
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 617, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/Library/Python/2.7/site-packages/pip/_internal/utils/hashes.py", line 48, in check_against_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 585, in written_chunks
    for chunk in chunks:
  File "/Library/Python/2.7/site-packages/pip/_internal/download.py", line 574, in resp_read
    decode_content=False):
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 465, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 430, in read
    raise IncompleteRead(self._fp_bytes_read, self.length_remaining)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Library/Python/2.7/site-packages/pip/_vendor/urllib3/response.py", line 345, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
ReadTimeoutError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out.

According to the guide, the environment of Scrapy is built.

Common Error Reporting and Solution of Scrapy Crawler Running

Follow the first Spider code exercise and save it in the file dmoz_spider. py in the tutorial/spiders directory:


import scrapy
 
class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]
 
    def parse(self, response):
        filename = response.url.split("/")[-2]
        with open(filename, 'wb') as f:
            f.write(response.body) 

Running in terminal: scrapy crawl dmoz, trying to start the crawler

Error Tip 1:

Scrapy 1.6.0 - no active project

Unknown command: crawl


alicedeMacBook-Pro:~ alice$ scrapy crawl dmoz
Scrapy 1.6.0 - no active project
 
Unknown command: crawl
 
Use "scrapy" to see available commands

The reason: scrapy. cfg is automatically generated when using the command line startproject. When using the command line cmd to start the crawler, crawl will search the scrapy. cfg file in the current directory of cmd, which is also explained in the official document. If the scrapy. cfg file is not found, the project is not considered.

Solution: So cd goes to the root directory of the dmoz project, the directory where the scrapy. cfg file is, and executes the command scrapy crawl dmoz

Under normal circumstances, the output should be:

2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial)

2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {}

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ...

2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ...

2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened

2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) < GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/ > (referer: None)

2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) < GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/ > (referer: None)

However, it is not

Error Tip 2:

  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load

    raise KeyError("Spider not found: {}".format(spider_name))

KeyError: 'Spider not found: dmoz'


alicedeMacBook-Pro:tutorial alice$ scrapy crawl dmoz
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: tutorial)
2019-04-19 09:28:23 [scrapy.utils.log] INFO: Versions: lxml 4.3.3.0, libxml2 2.9.9, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 16:39:00) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.3.0-x86_64-i386-64bit
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 69, in load
    return self._spiders[spider_name]
KeyError: 'dmoz'
 
During handling of the above exception, another exception occurred:
 
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/scrapy/spiderloader.py", line 71, in load
    raise KeyError("Spider not found: {}".format(spider_name))
KeyError: 'Spider not found: dmoz'

Reason: The directory located is incorrect. To enter the directory where dmoz is located

Solution: It is also relatively simple, and you can re-enter the check directory

Error Tip 3:

 File "/Library/Python/2.7/site-packages/twisted/internet/_sslverify.py", line 15, in < module >
from OpenSSL._util import lib as pyOpenSSLlib
ImportError: No module named _util


alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
  Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
    100% | In fact, in fact, the | 2.7MB 398kB/s 
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
  Running setup.py install for twisted ... error
    Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.13-intel-2.7
    creating build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/_version.py -> build/li
0

I have searched the information on the Internet for a long time, but there is still no solution. Some bloggers said that there was something wrong with the installation of pyOpenSSL or Scrapy, so they reinstalled pyOpenSSL and Scrapy, but they still reported the same error, so they really don't know how to solve it.

pyOpenSSL and Scrapy are reinstalled in the back, which seems to be solved ~


alicedeMacBook-Pro:~ alice$ pip install twisted==13.1.0
Collecting twisted==13.1.0
  Downloading https://files.pythonhosted.org/packages/10/38/0d1988d53f140ec99d37ac28c04f341060c2f2d00b0a901bf199ca6ad984/Twisted-13.1.0.tar.bz2 (2.7MB)
    100% | In fact, in fact, the | 2.7MB 398kB/s 
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from twisted==13.1.0) (4.1.1)
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from zope.interface>=3.6.0->twisted==13.1.0) (18.5)
Installing collected packages: twisted
  Running setup.py install for twisted ... error
    Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-install-inJwZ2/twisted/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/v1/9x8s5v8x74v86vnpqyttqy280000gn/T/pip-record-OmuVWF/install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.13-intel-2.7
    creating build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/copyright.py -> build/lib.macosx-10.13-intel-2.7/twisted
    copying twisted/_version.py -> build/li
1

Related articles: