Python tutorial using memcached to reduce the number of database queries

2020-05-05 11:28:49
OfStack

I did not know how to better optimize the performance of the web page, and then comparing python and php similar web page rendering speed recently, accidentally discovered a very simple very idiot but I haven't found a good way to (have to BS myself) : directly as some php applications such as Discuz BBS, in the generated web pages to print out "how many how many seconds this page generation time", and then constantly when accessing a web test, very can see visually what operation leads to the bottleneck, how to solve the bottleneck.

Therefore, I found that it unexpectedly took about 0.2 seconds for SimpleCD to generate the homepage, which was really unbearable: compared with the average of 0.02 seconds for the homepage of Discuz, the homepage of Discuz is undoubtedly more complicated than that of SimpleCD. This makes me feel bad, because it's not necessarily the Python language that caused the gap, but rather the result that I didn't optimize at all and the Discuz program was optimized very well.

In fact, you don't need to analyze it to know that the database is definitely a drag. When SimpleCD generates the homepage, it needs to conduct 42 queries in three databases of sqlite, which is an extremely inefficient design caused by historical reasons. However, most of these 40 + queries are actually very fast queries, and a close analysis shows that there are two performance heavy, the others are not slow.

The first big player is the number of data obtained:


SELECT count(*) FROM verycd

This operation takes a lot of time each time, because every time the database is locked and the primary key statistics are traversed, the greater the amount of data, the greater the time, O(N), N is the database size; Actually, it's very easy to solve this problem, just store the number of current data anywhere, and only change it when adding or deleting data, so the time is O(1),

The second big player is: get the most updated list of 20 data


SELECT verycdid,title,brief,updtime FROM verycd
 
  ORDER BY updtime DESC LIMIT 20;

Because it is indexed on updtime, the real query time is the search index time. So why is this slow? Because my data is inserted according to publish time, if I press update time to display it, I must do I/O in at least 20 different places, which is slow. The solution is to have it do I/O in one place. That is, unless the database adds new data/changes the old data, the return result of this statement is cached. That's 20 times faster :)

Next are 20 little case: get the publisher and the number of hits


SELECT owner FROM LOCK WHERE id=XXXX;
 
SELECT hits FROM stat WHERE id=XXXX;

Why not use sql's join statement to save some time? These data are stored in different databases for architectural reasons. stat is a database of click-through rate, which needs to be inserted frequently, so it is stored in mysql. lock and verycd are databases that require a large number of select operations. join -.-

cannot be join -

Anyway, this is not a problem, just like the previous solution, all cache

So throughout my example, optimizing the performance of a web page can be as simple as caching database queries. I believe this is true of most web apps :)

Finally, it is memcached's turn. Since we plan to cache, we still have disk I/O if we cache with files. It is better to cache directly into memory, which is much faster than I/O. So that's what memcached is.

memcached is a powerful tool because it supports distributed Shared memory caching, which is used by large sites, and is good for small sites as long as they can afford it. The size of the memory buffer required for the home page is estimated to be no more than 10K, not to mention that I am now a memory rich, still care about this?

Configure to run: since it's a single machine with nothing to work with, just change the memory and port


vi /etc/memcached.conf
 
/etc/init.d/memcached restart

used in python web applications


import memcache
 
mc = memcache.Client(['127.0.0.1:11211'], debug=0)

memcache is actually an map structure, and two functions are most commonly used:

the first one is set(key,value,timeout), and this is simply mapping key to value, timeout means when the mapping fails the second one is get(key), which returns value to key

So for a normal sql query you can do


sql = 'select count(*) from verycd'
 
c = sqlite3.connect('verycd.db').cursor()
 
 
 
#  The old way of doing it 
 
c.execute(sql)
 
count = c.fetchone()[0]
 
 
 
#  The way we're doing it now 
 
from hashlib import md5
 
key=md5(sql)
 
count = mc.get(key)
 
if not count:
 
  c.execute(sql)
 
  count = c.fetchone()[0]
 
  mc.set(key,count,60*5) # save 5 minutes

md5 is to make key more evenly distributed, and I won't explain the rest of the code because it's intuitive.

After the optimization of statement 1 and statement 2, the average generation time of the homepage has been reduced to 0.02 seconds, an order of magnitude of discuz. After the optimization of statement 3, the final result is that the page generation time is reduced to about 0.006 seconds, after the optimization of memcached few lines of code, the performance is improved by 3300%. It's time to straighten up and see