python crawls the main information and submission video of up

2021-11-10 09:52:00
OfStack

Project address:

https://github.com/cgDeepLearn/BilibiliCrawler

Project characteristics

Adopted a fixed anti-climbing strategy. Bilibili changed the api of the user page, and the user crawl parser needs to be reconstructed.

Quick start

Pull project, git clone https://github.com/cgDeepLearn/BilibiliCrawler. git Go to the project home directory and install the virtual environment crawlenv (please refer to the Virtual Environment Installation in the Instructions). Activate the environment and run crawl in the home directory, and the crawling results will be saved in the data directory csv file.


ource activate crawlenv
python initial.py file  #  Initialization file Mode 
python crawl_user.py file 1 100  # file Mode, 1 100 Is the beginning, the end bilibili Adj. uid

Enter the data directory to view the crawled data, is it very simple!

If you need to use database saving and other settings, please see the following instructions

Instructions for use

1. Pull items


git clone https://github.com/cgDeepLearn/BilibiliCrawler.git

2. Enter the project home directory and install the virtual environment

If anaconda is installed


conda create -n crawlenv python=3.6
source activate crawlenv  #  Activate the virtual environment 
pip install -r requirements.txt

If virtualenv is used


virtualenv crawlenv
source crawlenv/bin/activate  #  Activate the virtual environment, windows Don't use it source
pip install -r requirements.txt  #  Setup project dependencies

3. Modify the configuration file

Enter the config directory, Modify the config. ini configuration file (the default is postgresql database, if you are using postgresql, only need to replace the parameters with yours, other steps below can be ignored) database configuration Select one of your locally installed, and replace the parameters with yours. If you need more automated database configuration, please move to my DB_ORM project


[db_mysql]
user = test
password = test
host = localhost
port = 3306
dbname = testdb

[db_postgresql]
user = test
password = test
host = localhost
port = 5432
dbname = testdb

Then modify the function in conf. py to get the configuration file


def get_db_args():
    """
     Get database configuration information 
    """
    return dict(CONFIG.items('db_postgresql'))  #  If you are installing mysql, Please replace the parameter with db_mysql

Enter the db directory and modify the DSN of the connection database of basic. py


# connect_str = "postgresql+psycopg2://{}:{}@{}:{}/{}".format(kwargs['user'], kwargs['password'], kwargs['host'], kwargs['port'], kwargs['dbname'])
#  If you are using mysql Please put the above connect_str Replace with the following 
connect_str = "mysql+pymysql://{}:{}@{}:{}/{}?charset=utf8".format(kwargs['user'], kwargs['password'], kwargs['host'], kwargs['port'], kwargs['dbname'])
# sqlite3 , mongo Please move my DB_ORM Projects, others 1 Some databases will also add support

Step 4 Run the crawler

Activate the virtual environment in the home directory. For the first time, please execute


python initial.py db # db Mode, file Mode, set the db Change to file
# file Mode saves the crawling results in the data Directory 
# db The schema will save the data in the set database 
#  If you use it again db Mode operation will drop After all the tables create Please use it again carefully after the first operation !!!
#  If the modification adds a table and does not want to empty the data, run  python create_all.py

Start grabbing samples


python crawl_user.py db 1 10000 # crawl_user  Grab user data, db  Saved in the database,  1 10000 Start and stop for grasping id
python crawl_video_ajax.py db 1 100 # crawl_video_ajax  Capture video ajax The information is saved to the database ,
python crawl_user_video.py db 1 10000 # Simultaneous grabbing user  And videoinfo
#  Examples are uid From 1 To 100 Adj. user If there is a submission video, grab the information of the submission video, 
#  If you want to use video, id Crawl one by one please run python crawl_video_by_aid.py db 1 1000

Crawling rate control

Procedures have been carried out a number of capture rate settings, but each machine cpu, mem different capture rate is also different, please modify as appropriate
If it is too fast or too slow, please modify the sleepsec parameters in each crawl, ip will be limited in access frequency, overspeed will lead to incomplete crawling data,
After that, the running parameters speed (high, low) will be added, so there is no need to manually configure the rate

Journal

Crawl log in logs directory
user and video are crawling logs of users and videos respectively
storage is a database log. If you need to change the log format, please modify the logger module

Running in the background

Run python under linux... preceded by nohup, for example:


nohup python crawl_user db 1 10000

Program output save file, the default package will exist in the home directory nohup. out file, add > fielname is saved in the settings file:


git clone https://github.com/cgDeepLearn/BilibiliCrawler.git

0 More

The producer-consumer mode used by program multithreading produces printed information about the running status of the program, similar to the following


git clone https://github.com/cgDeepLearn/BilibiliCrawler.git

If you want to run faster, please comment out the printing program after the program items are set up


git clone https://github.com/cgDeepLearn/BilibiliCrawler.git

The project is stand-alone multithreaded. If you want to use distributed crawling, please refer to Crawler-Celery

The above is the details of python's main information and submission video. For more information about python's crawling, please pay attention to other related articles on this site!