Distributed deployment of pyspider in centos7

  • 2020-06-01 10:04:46
  • OfStack

1. Setting up environment:

System version: Linux centos-linux.shared 3.10.0-123.el7.x86_64 #1 SMP Mon Jun 30 12:09:22 UTC 2014 x86_64 x86_64 GNU/Linux

python version: Python 3.5.1

1.1. Build python3 environment:

I chose the integrated environment Anaconda after trying

Compile 1.1.1.


#  Download the dependent 
yum install -y ncurses-devel openssl openssl-devel zlib-devel gcc make glibc-devel libffi-devel glibc-static glibc-utils sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-deve
#  download python version 
wget https://www.python.org/ftp/python/3.5.1/Python-3.5.1.tgz
#  Or use domestic sources 
wget http://mirrors.sohu.com/python/3.5.1/Python-3.5.1.tgz
mv Python-3.5.1.tgz /usr/local/src;cd /usr/local/src
#  Unpack the 
tar -zxf Python-3.5.1.tgz;cd Python-3.5.1
#  Compile the installation 
./configure --prefix=/usr/local/python3.5 --enable-shared
make && make install
#  Build soft links 
ln -s /usr/local/python3.5/bin/python3 /usr/bin/python3
echo "/usr/local/python3.5/lib" > /etc/ld.so.conf.d/python3.5.conf
ldconfig
#  validation python3
python3
# Python 3.5.1 (default, Oct 9 2016, 11:44:24)
# [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux
# Type "help", "copyright", "credits" or "license" for more information.
# >>>
# pip
/usr/local/python3.5/bin/pip3 install --upgrade pip
ln -s /usr/local/python3.5/bin/pip /usr/bin/pip
#  I had a problem installing it   will pip reinstall 
wget https://bootstrap.pypa.io/get-pip.py --no-check-certificate
python get-pip.py

1.1.2. Integrated environment anaconda


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2

1.2. Installation mariaDB


#  The installation 
yum -y install mariadb mariadb-server
#  Start the 
systemctl start mariadb
#  Set to boot 
systemctl enable mariadb
#  Configure a password   The default is empty 
mysql_secure_installation
#  The login 
mysql -u root -p
#  create 1 A user   Set your own password 
CREATE USER 'user_name'@'localhost' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'localhost' WITH GRANT OPTION;
CREATE USER 'user_name'@'%' IDENTIFIED BY 'user_pass';
GRANT ALL PRIVILEGES ON *.* TO 'user_name'@'%' WITH GRANT OPTION;

1.3. Installation pyspider

I use Anaconda


#  Building a virtual environment sbird python version 3.*
conda create -n sbird python=3*
#  Into the environment 
source activate sbird
#  The installation pyspider
pip install pyspider
#  An error  
# it does not exist. The exported locale is "en_US.UTF-8" but it is not supported
#  perform   Can be written to .bashrc
export LC_ALL=en_US.utf-8
export LANG=en_US.utf-8
#ImportError: pycurl: libcurl link-time version (7.29.0) is older than compile-time version (7.49.0)
conda install pycurl
#  exit 
source deactivate sbird
#  In a virtual machine   unreachable localhost:5000  Shut-down firewall 
systemctl stop firewalld.service
######### Direct source ==============
mkdir git;cd git
#  download 
git clone https://github.com/binux/pyspider.git
#  The installation 
/root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py

Other methods


#  Building a virtual environment 
pip install virtualenv
mkdir python;cd python
#  Creating a virtual environment pyenv3
virtualenv -p /usr/bin/python3 pyenv3
#  Enter the virtual environment   Activate the environment 
cd pyenv3/
source ./bin/activate
pip install pyspider
#  if pycurl An error  
yum install libcurl-devel
#  Continue to 
pip install pyspider
#  Shut down 
deactivate

I recommend the anaconda installation

If an error occurs during the pyspider run, refer to the anaconda installation section. From here, you can see the localhost:5000 page.

1.4. Installation Supervisor


#  The installation 
yum install supervisor -y
#  If it cannot be retrieved   Add ali's epel The source 
vim /etc/yum.repos.d/epel.repo
#  Add the following 
[epel]
name=Extra Packages for Enterprise Linux 7 - $basearch
baseurl=http://mirrors.aliyun.com/epel/7/$basearch
http://mirrors.aliyuncs.com/epel/7/$basearch
failovermethod=priority
enabled=1
gpgcheck=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7

[epel-debuginfo]
name=Extra Packages for Enterprise Linux 7 - $basearch - Debug
baseurl=http://mirrors.aliyun.com/epel/7/$basearch/debug
http://mirrors.aliyuncs.com/epel/7/$basearch/debug
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0

[epel-source]
name=Extra Packages for Enterprise Linux 7 - $basearch - Source
baseurl=http://mirrors.aliyun.com/epel/7/SRPMS
http://mirrors.aliyuncs.com/epel/7/SRPMS
failovermethod=priority
enabled=0
gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-EPEL-7
gpgcheck=0
#  The installation 
yum install supervisor -y
#  Test for successful installation 
echo_supervisord_conf

1.4.1. Supervisor usage


supervisord   #supervisor The server side of the   Start the 
supervisorctl  # Start the supervisor Command line window 
#  Let's say I create a process pyspider01
vim /etc/supervisord.d/pyspider01.ini
#  Write the following 
[program:pyspider01]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/pyspider01.log
#  overloading 
supervisorctl reload
#  Start the 
supervisorctl start pyspider01
#  This can also be started 
supervisord -c /etc/supervisord.conf
#  Check the status 
supervisorctl status
# output 
pyspider01            RUNNING  pid 4026, uptime 0:02:40
#  Shut down 
supervisorctl shutdown

1.5. Installation redis


#  Message queue adoption redis
mkdir download;cd download
wget http://download.redis.io/releases/redis-3.2.4.tar.gz
tar xzf redis-3.2.4.tar.gz
cd redis-3.2.4
make
#  Or directly yum The installation 
yum -y install redis
#  Start the 
systemctl start redis.service
#  restart 
systemctl restart redis.service
#  stop 
systemctl stop redis.service
#  Check the status 
systemctl status redis.service
#  Change the file /etc/redis.conf
vim /etc/redis.conf
#  Change the content 
daemonize no  Instead of  daemonize yes
bind 127.0.0.1  Instead of  bind 10.211.55.22( Current server ip)
#  restart redis
systemctl restart redis.service

1.6. About self-starting


# Supervisor Add to self-booting service 
systemctl enable supervisord.service
# redis Add to self-booting service 
systemctl enable redis.service
#  Turn off the firewall and start 
systemctl disable firewalld.service

At this point, the pyspider single server operating environment is set up and deployed. Launch localhost:5000 and enter the web interface.

Also can write a script to run in/pyspider/supervisor/pyspider01 log view the running state.

2. Distributed deployment

The server you just configured, name it centos01, and then deploy two centos02 and centos03, respectively, according to this configuration.

As follows:

Server name ip description


centos01 10.211.55.22 redis,mariaDB, scheduler
centos02 10.211.55.23 fetcher, processor, result_worker,phantomjs
centos03 10.211.55.24 fetcher, processor,,result_worker,webui


2.1.centos01

Enter the server centos01, after the first step, the basic environment has been set up, first edit the configuration file /pyspider/ config.json


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2
0

Try running:


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2
1

After the success of the operation, can be directly change/etc/supervisord d/pyspider01 ini is as follows:


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2
2

centos01 has been deployed.

2.2.centos02

In centos02, you need to run result_worker, processor, phantomjs, fetcher

Create files respectively:


/etc/supervisord.d/result_worker.ini

[program:result_worker]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json result_worker
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/result_worker.log
/etc/supervisord.d/processor.ini

[program:processor]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json processor
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/processor.log
/etc/supervisord.d/phantomjs.ini

[program:phantomjs]

command   = /pyspider/phantomjs --config=/pyspider/pjsconfig.json /pyspider/phantomjs_fetcher.js 25555
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/phantomjs.log
/etc/supervisord.d/fetcher.ini

[program:fetcher]

command   = /root/anaconda3/envs/sbird/bin/python /root/git/pyspider/run.py -c /pyspider/config.json fetcher
directory  = /root/git/pyspider
user     = root
process_name = %(program_name)s
autostart  = true
autorestart = true
startsecs  = 3

redirect_stderr     = true
stdout_logfile_maxbytes = 500MB
stdout_logfile_backups = 10
stdout_logfile     = /pyspider/supervisor/fetcher.log

Create pjsconfig.json in the pyspider directory


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2
4

Download phantomjs to/pyspider/folder, will git/pyspider/pyspider/fetcher/phantomjs_fetcher js copy to phantomjs_fetcher. js


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2
5

centos02 has been deployed.

2.3.centos03

Deployment of the three processes fetcher, processor, result_worker and centos02 1, this server is mainly on the basis of the previous plus webui

Create documents:


#  Integrated environment anaconda( recommended )
wget https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
#  Just install it directly. 
./Anaconda3-4.2.0-Linux-x86_64.sh
#  If the error, may be decompression failure 
yum install bzip2
6

3. Summary

Visit http: / / 10.211.55.24:5000 can, to crawl.


Related articles: