python crawls the main information and submission video of up
- 2021-11-10 09:52:00
- OfStack
Project address:
https://github.com/cgDeepLearn/BilibiliCrawler
Project characteristics
Adopted a fixed anti-climbing strategy. Bilibili changed the api of the user page, and the user crawl parser needs to be reconstructed.
Quick start
Pull project, git clone https://github.com/cgDeepLearn/BilibiliCrawler. git
Go to the project home directory and install the virtual environment crawlenv (please refer to the Virtual Environment Installation in the Instructions).
Activate the environment and run crawl in the home directory, and the crawling results will be saved in the data directory csv file.
ource activate crawlenv
python initial.py file # Initialization file Mode
python crawl_user.py file 1 100 # file Mode, 1 100 Is the beginning, the end bilibili Adj. uid
Enter the data directory to view the crawled data, is it very simple!
If you need to use database saving and other settings, please see the following instructions
Instructions for use
1. Pull items
git clone https://github.com/cgDeepLearn/BilibiliCrawler.git
2. Enter the project home directory and install the virtual environment
If anaconda is installed
conda create -n crawlenv python=3.6
source activate crawlenv # Activate the virtual environment
pip install -r requirements.txt
If virtualenv is used
virtualenv crawlenv
source crawlenv/bin/activate # Activate the virtual environment, windows Don't use it source
pip install -r requirements.txt # Setup project dependencies
3. Modify the configuration file
Enter the config directory, Modify the config. ini configuration file (the default is postgresql database, if you are using postgresql, only need to replace the parameters with yours, other steps below can be ignored) database configuration Select one of your locally installed, and replace the parameters with yours. If you need more automated database configuration, please move to my DB_ORM project
[db_mysql]
user = test
password = test
host = localhost
port = 3306
dbname = testdb
[db_postgresql]
user = test
password = test
host = localhost
port = 5432
dbname = testdb
Then modify the function in conf. py to get the configuration file
def get_db_args():
"""
Get database configuration information
"""
return dict(CONFIG.items('db_postgresql')) # If you are installing mysql, Please replace the parameter with db_mysql
Enter the db directory and modify the DSN of the connection database of basic. py
# connect_str = "postgresql+psycopg2://{}:{}@{}:{}/{}".format(kwargs['user'], kwargs['password'], kwargs['host'], kwargs['port'], kwargs['dbname'])
# If you are using mysql Please put the above connect_str Replace with the following
connect_str = "mysql+pymysql://{}:{}@{}:{}/{}?charset=utf8".format(kwargs['user'], kwargs['password'], kwargs['host'], kwargs['port'], kwargs['dbname'])
# sqlite3 , mongo Please move my DB_ORM Projects, others 1 Some databases will also add support
Step 4 Run the crawler
Activate the virtual environment in the home directory. For the first time, please execute
python initial.py db # db Mode, file Mode, set the db Change to file
# file Mode saves the crawling results in the data Directory
# db The schema will save the data in the set database
# If you use it again db Mode operation will drop After all the tables create Please use it again carefully after the first operation !!!
# If the modification adds a table and does not want to empty the data, run python create_all.py
Start grabbing samples
python crawl_user.py db 1 10000 # crawl_user Grab user data, db Saved in the database, 1 10000 Start and stop for grasping id
python crawl_video_ajax.py db 1 100 # crawl_video_ajax Capture video ajax The information is saved to the database ,
python crawl_user_video.py db 1 10000 # Simultaneous grabbing user And videoinfo
# Examples are uid From 1 To 100 Adj. user If there is a submission video, grab the information of the submission video,
# If you want to use video, id Crawl one by one please run python crawl_video_by_aid.py db 1 1000
Crawling rate control
Procedures have been carried out a number of capture rate settings, but each machine cpu, mem different capture rate is also different, please modify as appropriate
If it is too fast or too slow, please modify the sleepsec parameters in each crawl, ip will be limited in access frequency, overspeed will lead to incomplete crawling data,
After that, the running parameters speed (high, low) will be added, so there is no need to manually configure the rate
Crawl log in logs directory
user and video are crawling logs of users and videos respectively
storage is a database log. If you need to change the log format, please modify the logger module
Run python under linux... preceded by nohup, for example:
nohup python crawl_user db 1 10000
Program output save file, the default package will exist in the home directory nohup. out file, add > fielname is saved in the settings file:
git clone https://github.com/cgDeepLearn/BilibiliCrawler.git
0
More
The producer-consumer mode used by program multithreading produces printed information about the running status of the program, similar to the following
git clone https://github.com/cgDeepLearn/BilibiliCrawler.git
1
If you want to run faster, please comment out the printing program after the program items are set up
git clone https://github.com/cgDeepLearn/BilibiliCrawler.git
2
More
The project is stand-alone multithreaded. If you want to use distributed crawling, please refer to Crawler-Celery
The above is the details of python's main information and submission video. For more information about python's crawling, please pay attention to other related articles on this site!