Using Celery to implement Django blog PV statistical function details
- 2020-05-30 20:33:45
A few days ago, I added pv statistics to the articles on the website. Previously, only uv statistics were added. The reason I didn't add pv statistics before is that I felt that I needed to do a write to the database every time a user visited an the5fire blog. After all, from the point of view of the user's 1 visit to the the5fire blog, I only needed to get the corresponding article from the database (usually from the cache) and return it to the browser. The write operation is meaningless. The previous version, uv, also only had one write per user within 24 hours.
On the other hand, for a small site like the5fire blog, it doesn't matter if I write a dozen or so times every time I visit the database. But we do not have to have an anti - 100 - million - level flow of the heart.
For those of you who don't understand, you can go out and look at other people's websites. Yeah, those sites that have billions, billions, billions of visitors, and look at how they handle things that people write, like messages.
The significance of PV
Say why, say business. All websites have statistics like pv and uv. Even the length of stay, the conversion rate of each type of page, and so on. My job at sohu is, in plain English, making websites. The business metrics you focus on are traffic-related things. At the same time as webmaster so many years, will also refer to baidu statistics in 1 some indicators to do some adjustment.
But this time it just says pv, pv of one article.
Aside from the abnormal access, an article on the Internet, the more people who visit it, the higher the value of the article. After all, people click on things of value. That number is uv (User View/Visit). So what is pv? An article is well written, especially a technical article, which may be visited many times. For example, I like to collect good articles and review them when I have time. Each review (refresh the page) counts as an pv. The article that can accomplish person reader to read many times, value will be higher. Therefore, the pv/uv ratio of an article is also an indicator to measure the value of the article. Especially in the era of popular headlines. (ok, here is another crooked sentence, the title party is not the product of the era of we media, the era of blog, but the era of we media appears to be more concentrated.)
Simple say value have no what feeling, the ancients is not said, the value can change a few bucket of rice. (I made it up)
Take all current news websites/media platforms for example, pv can be equated with ￥. More traffic means more revenue, whether it comes from advertising or by freeing up traffic to other channels. Sometimes I also think, is the goal of 1 slice really to better understand the user, to give the user what he wants to see? Perhaps, but one of the things that never gets round is building a business model in which advertisers and investors pay for the length of a user's stay. Let the user stay on the platform more, spend more time. (purely personal, clear and deliberate)
Take another more direct example. Now we media is prevalent, how many people want 100,000 +, and a good public account can set the price of advertising/advertorials and other types of cooperation according to the number of views (or fans) of previous articles. In fact, you go to the micro channel easy or easy to know.
So is pv becoming attractive?
For websites, there are several statistical methods of pv and uv that the5fire understands
The first two are both heavily coupled implementations that require code to be inserted into the concrete page. The latter two are similar, essentially collecting the nginx logs, but at different stages of collection. The third type does not receive the logs until the page is fully open. The fourth is that if you visit the page and upstream returns a status code of 200, you are successful, even if the end user does not see the page.
In a word, each has its own advantages and disadvantages, which can be referred to each other.
The way blogs are implemented
As mentioned above, it is mainly for the purpose of using the celery distributed task queue. Using it in Django is a simple matter.
Using Celery in Django requires the Celery runtime to be able to use the various modules of the Django project, so first specify the settings module. I used version 1.11 of Django. Add celery.py to the wsgi.py peer directory. The code is as follows:
# coding:utf-8 from __future__ import absolute_import, unicode_literals import os from celery import Celery PROFILE = os.environ.get('DJANGO_SELFBLOG_PROFILE', 'develop') # I am the settings.py Split into: develop.py,product.py os.environ.setdefault("DJANGO_SETTINGS_MODULE", "django_selfblog.settings.%s" % PROFILE) app = Celery('selfblog', broker="redis://127.0.0.1:6666/2") app.config_from_object('django.conf:settings', namespace='CELERY') # Load task modules from all registered Django app configs. app.autodiscover_tasks()
redis, which is not officially recommended, is used as broker instead of Rabbitmq. The main reason is that the cache USES Redis, so as not to introduce more systems that need to be maintained.
Once you have defined the startup file, you need to define the specific tasks. Write the specific task in app/ tasks.py:
# coding:utf-8 from __future__ import unicode_literals from django.db.models import F from .models import Post from django_selfblog.celery import app @app.task def increase_pv(post_id): return Post.objects.filter(id=post_id).update(pv=F('pv')+1) @app.task def increase_uv(post_id): return Post.objects.filter(id=post_id).update(uv=F('uv')+1)
Add a call to the views.py corresponding position on the page to visit the article:
from .tasks import increase_pv, increase_uv # .... Omit context increase_pv.delay(self.post.id) increase_uv.delay(self.post.id)
In this way, the logic to calculate pv and uv for each user access is executed in the distributed task manager without affecting the access.
If you want to check the status of the task, for example, by:
r = increase_pv.delay(self.post.id) print r.ready()
To see if the task is complete in this way, you need to introduce django-celery-results, using the following steps:
pip install django-celery-resultsPut django_celery_results into INSTALLED_APPS configuration
CELERY_RESULT_BACKEND = 'django-db'Or 'django - cache' If django-db is configured, meaning that the results need to be stored in the database, then execute
python manage.py migrate django_celery_resultsTo build the table
Once these configurations are complete, all that remains is the deployment. The the5fire blog does this via fabric every time the code is updated and redeployed. fab re_deploy:master code is deployed to the server. After adding celery, you only need to add the configuration of supervisord. After all, the code of celery is also in the blog code.
supervisord added configuration:
[program:celery] command=celery -A selfblog worker -P gevent --loglevel=INFO --concurrency=5 directory=/home/the5fire/selfblog/ process_name=%(program_name)s_%(process_num)d umask=022 startsecs=0 stopwaitsecs=0 redirect_stderr=true stdout_logfile=/tmp/log/celery_%(process_num)02d.log numprocs=1 numprocs_start=1 environment=DJANGO_SELFBLOG_PROFILE=product
The celery process is then restarted with each redeployment.
In the Django project, the biggest performance loss is ORM, which is easy to be cheated if you are not familiar with it.
Take adding pv as an example. Each time the user visits an article, pv field is +1. In terms of code, it is:
# Never write such stupid code post = Post.objects.get(pk=post_id) post.pv = post.pv + 1 post.save()
This is the easiest thing to do, but for the most part, the user accesses an article, and that article is usually in the cache, because you don't have to go to the database every time. So how do you do that? Well, the intuitive thing to do is to get post first, and then plus 1, save, and so on. But there are problems with competition.
For example, if 100 people are accessing an article at the same time, I am starting multiple threads/processes to process the request, and it is possible that all processes execute post = Post.objects.get (pk=post_id) at the same time. Assuming that pv of this article is 100 in the database, post.pv is 100. After all the users have executed post.save (), the result is 101, which is 100 concurrent accesses. It is possible for pv to only add 1.
There are two ways to solve this problem.
1. Locking. As far as I know, Django does not provide this, which needs to be realized by itself. But no one would do that. 2. Perform autoincrement using mysql, which I used above.
For method 2, how do you implement it in Django? It actually translates as sql
UPDATE `blog_post` SET `pv` = (`blog_post`.`pv` + 1) WHERE `blog_post`.`id` = <post_id>;
The Django code is:
About F expression can refer to the official document: https: / / docs djangoproject. com en / 1.11 / ref/models/expressions / # django db. models. F