In depth understanding of redis_memcached failure principle of summary

  • 2020-06-15 10:28:26
  • OfStack

Recently, an incomprehensible BUG appeared in the project. By default, users can enjoy the privilege of 1 fixed number of times per day. After using the privilege, they cannot enjoy it. Essentially, the redis cache is out of date, just let it expire at 12 a.m.

But the problem happened, it just didn't fail... The reason is that there is some time difference between the server time and request time because the zero boundary point has not been processed properly. It also takes time for expire of key to process to a fixed quantity of 1. These two main reasons. Add an existential judgment, invalidation and leave him 10 seconds in advance to be ready

So learn the lesson, understand the failure mechanism, but also remind you not to make such a simple mistake.

1. How can I trigger the failure of key?

In addition to calling the PERSIST command, is there any other case where the expiration time of a primary key is revoked? The answer is yes.

1. When you delete a primary key with the DEL command
2. When a primary key with expiration time set is overwritten by an update, the primary key's expiration time is also revoked.
3. The special command is RENAME, when we use RENAME to rename after a primary key, the associated failure before time will automatically transfer to the new primary key, but if a primary key is covered by the RENAME (such as primary key hello may be ordered RENAME world hello covered) and then covered the failure of the primary key time will be cancelled automatically, and the new primary key is to keep the original characteristics of the primary key.

Note that the primary key is overwritten by the update, not the corresponding Value of the primary key. Therefore, SET, MSET, or GETSET may cause the primary key to be overwritten by the update. INCR, DECR, LPUSH, HSET, etc. are all values corresponding to the update of the primary key.

2. How does Redis manage and maintain primary keys

1). Storage structure of Redis


typedef struct redisDb {
  dict *dict;// Store mappings for primary keys and values 
  dict *expires;// Stores a map of the primary key and expiration time 
  dict *blocking_keys;
  dict *ready_keys;
  dict *watched_keys;
  int id;
} redisDb;

Just look at the first two structures,
dict: Used to maintain all ES55en-ES56en pairs contained in 1 Redis database (the structure can be understood as dict[key]:value, the mapping between primary keys and values)

expires: Primary key used to maintain 1 Redis database with expiry time set (structure can be understood as expires[key]:timeout, i.e. primary key and expiry time mapping).

The primary key with the expiration time set and the specific expiration time are all maintained in the dictionary table expires

When we use the EXPIRE, EXPIREAT, PEXPIRE, and PEXPIREAT commands to set the expiration time of one primary key,
Redis first looks in the dictionary table dict to see if the primary key to be set exists, and if so, adds the primary key and expiration time to the dictionary table expires.

2). Negative methods

1. expireIfNeeded function

Trigger: This function is called in any function that accesses data, and Redis is called when implementing GET, MGET, HGET, LRANGE, and all other commands that involve reading data

Significance: Before reading the data, check whether the key fails under 1. If it fails, delete it.

2.propagateExpire function (called in the previous one) main function

Trigger: When the last function was executed, it was inside it

Meaning: Used to broadcast information about a failed primary key before it is officially deleted

Operation:
(1). One is sent to AOF file to record the operation of deleting the invalid primary key in the standard command format of DEL Key;
(2). All Slave sent to the current Redis server will be informed by the same 1 operation to delete the failed primary key in the standard command format of DEL Key to delete their respective failed primary key.

More than we know how Redis to 1 kind of negative way to delete the failure of the primary key, but only in this way is obviously not enough, because if some of the failure of the primary key slowly, etc to visit again, Redis will never know the primary key have failed, also will never remove them, this will undoubtedly lead to the waste of memory space. So here's the method.

3). Active method :(this method makes use of Redis's time events, that is, it interrupts 1 to complete 1 specified operation every 1 period of time)

1. serverCron function:

Trigger: It is created when the Redis server starts

What it does: This callback function not only checks and deletes failed primary keys, but also updates statistics, controls client connection timeouts, triggers BGSAVE and AOF, and so on

2. activeExpireCycle function:

Trigger: When the last function is executed, the number of times it executes per second is specified by the macro definition [REDIS_DEFAULT_HZ]. By default, it executes 10 times per second.

Operation:
a). Traverse the expires dictionary table of each database in Redis server, and try to randomly sample [REDIS_EXPIRELOOKUPS_PER_CRON] (the default value is 10) of the primary keys with the expiration time set.
b). Check that they have failed and delete the failed primary key.
c). If the number of failed primary keys accounts for more than 25% of the sample number, Redis will consider that there are still many failed primary keys in the current database, so it will continue to carry out random sampling and deletion in the next round.
d). Do not stop processing the current database until the previous ratio is less than 25% and move to the next database.

Others: The activeExpireCycle function avoids a primary key deletion that takes up too much of the CPU resource, so instead of trying to process all the databases in Redis once, it will only process REDIS_DBCRON_DBS_PER_CALL(default is 16) and has a processing time limit

3. How does Memcached delete the failed primary key compare with Redis?

First of all, Memcached also takes a negative approach when deleting a failed primary key, that is, Memcached does not monitor primary key failure internally, but checks primary key failure only when it is accessed through Get.

Second, the biggest difference between Memcached and Redis in primary key failure is that Memcached does not actually delete a failed primary key, as Redis does, but simply recyles the space occupied by a failed primary key. So when new data is written to the system, Memcached takes precedence over those Spaces that have failed primary keys. If you run out of space for a failed primary key, Memcached can also use the LRU mechanism to reclaim space that has long been inaccessible,

Therefore, Memcached does not require periodic deletes like in Redis, which is also determined by the memory management mechanism used by Memcached. At the same time, it should be noted that Redis can also be configured with the parameter maxmemory-ES206en to decide whether to adopt the LRU mechanism to reclaim memory space when OOM appears

4. Conclusion:

redis performs 10 expiration checks per second. In each check, 10 key are randomly selected from the expire table of a library to detect whether it is invalid. If it is invalid, it will be deleted. When the failure ratio exceeds 1/4, the random selection 10key will be re-executed this time, which will not be counted as one of the 10 times until all the 10 times of 1 second have been executed.

Q:

So somebody said, 10,000, 10,000! The failed key does enough, and the 10 times in 1 second are not completed, and the next second, how to do?

A:

If redis has detection mechanism, it will not drag CPU to death:

a. Limit the number of databases to be processed at a time,

The limit of the number of times the b. activeExpireCycle function is executed in 1 second,

c. Limit of CPU time allocated to activeExpireCycle function,

d. Continue to remove the primary key failure percentage limit. Redis has significantly reduced the impact of primary key failure on the overall performance of the system.

Therefore, the principle of setting the failure time can also be obtained: avoid the failure of large quantities of key at the same time point as far as possible, which requires processing time.

Main functions of negative failure.propagateExpire:


void propagateExpire(redisDb *db, robj *key) {
  robj *argv[2];
  //shared.del Is in the Redis The server is already initialized when it is started 1 A commonly used Redis Objects, that is, DEL The command 
  argv[0] = shared.del;
  argv[1] = key;
  incrRefCount(argv[0]);
  incrRefCount(argv[1]);
  // check Redis Is the server on AOF If it is turned on, it is recorded as a failed primary key 1 article DEL The log 
  if (server.aof_state != REDIS_AOF_OFF)
    feedAppendOnlyFile(server.delCommand,db->id,argv,2);
  // check Redis Whether the server owns it Slave If yes, to all Slave send DEL Invalid primary key command, here it is 
  // The above expireIfNeeded Find yourself in a function Slave There is no need to actively delete the cause of the failed primary key because of it 
  // Just listen to Master The command that was sent OK the 
  if (listLength(server.slaves))
    replicationFeedSlaves(server.slaves,db->id,argv,2);
  decrRefCount(argv[0]);
  decrRefCount(argv[1]);
}

Active failure main function.activeExpireCycle function:


void activeExpireCycle(void) {
  // Because every time activeExpireCycle The function will not 1 Subsexual check all Redis Database, so it needs to be recorded 
  // At the end of each function call 1 a Redis The number of the database, so the next call activeExpireCycle function 
  // You can also continue processing from this database, and here it is current_db Be declared static The reason while the other 1
  // A variable timelimit_exit For the record 1 Time to call activeExpireCycle Is the execution time of the function reached 
  // I'm running out of time, so I need to declare it static
  static unsigned int current_db = 0;
  static int timelimit_exit = 0;
  unsigned int j, iteration = 0;
  // Each call activeExpireCycle Functionally processed Redis The number of databases is REDIS_DBCRON_DBS_PER_CALL
  unsigned int dbs_per_call = REDIS_DBCRON_DBS_PER_CALL;
  long long start = ustime(), timelimit;
  // If the current Redis The number of databases in the server is less than REDIS_DBCRON_DBS_PER_CALL , then all the databases are processed, 
  // If the 1 Time to call activeExpireCycle The execution time of the function has reached the time limit, indicating that there are more failed primary keys 
  // All databases are selected for processing 
  if (dbs_per_call > server.dbnum || timelimit_exit)
    dbs_per_call = server.dbnum;
  // perform activeExpireCycle The longest time of the function (in microseconds), where REDIS_EXPIRELOOKUPS_TIME_PERC
  // It's a unit of time that can be allocated to activeExpireCycle Functionally executed CPU Time ratio, the default value is 25 . server.hz
  // Is the 1 Seconds. activeExpireCycle The number of calls, so the more straightforward way of writing this formula is to say 
  (1000000 * (REDIS_EXPIRELOOKUPS_TIME_PERC / 100)) / server.hz
  timelimit = 1000000*REDIS_EXPIRELOOKUPS_TIME_PERC/server.hz/100;
  timelimit_exit = 0;
  if (timelimit <= 0) timelimit = 1;
  // Iterate through each Redis Invalid data in the database 
  for (j = 0; j < dbs_per_call; j++) {
    int expired;
    redisDb *db = server.db+(current_db % server.dbnum);
    // This will be done immediately current_db add 1 , which ensures that even if you can't delete all the current ones within the time limit this time 
    // Invalid primary key in database, down 1 Time to call activeExpireCycle1 The sample will be from the bottom 1 I'm going to start processing, 
    // This ensures that every database has a chance to be processed 
    current_db++;
    // Starts processing failed primary keys in the current database 
    do {
      unsigned long num, slots;
      long long now;
      // if expires The dictionary table size is 0 , indicating that there is no primary key to set the expiration time in the database 
      //1 The database 
      if ((num = dictSize(db->expires)) == 0) break;
      slots = dictSlots(db->expires);
      now = mstime();
      // if expires The dictionary table is not empty, but its fill rate is insufficient 1% , then the cost of randomly selecting the primary key for checking 
      // It's going to be very high, so I'll just check it right here 1 The database 
      if (num && slots > DICT_HT_INITIAL_SIZE &&
        (num*100/slots < 1)) break;
      expired = 0;
      // if expires In the dictionary table entry If the number is not enough to reach the number of samples, all are selected key As a sample 
      if (num > REDIS_EXPIRELOOKUPS_PER_CRON)
        num = REDIS_EXPIRELOOKUPS_PER_CRON;
      while (num--) {
        dictEntry *de;
        long long t;
        // Random access 1 Three primary keys with expiration time set to check if they have expired 
        if ((de = dictGetRandomKey(db->expires)) == NULL) break;
        t = dictGetSignedIntegerVal(de);
        if (now > t) {
          // Delete the primary key if it is found to be invalid 
          sds key = dictGetKey(de);
          robj *keyobj = createStringObject(key,sdslen(key));
          // Also broadcast the invalidation of the primary key before deletion 
          propagateExpire(db,keyobj);
          dbDelete(db,keyobj);
          decrRefCount(keyobj);
          expired++;
          server.stat_expiredkeys++;
        }
      }
      // In every 1 After the subsample is deleted, the pair iteration add 1 , each 16 After the second sampling is deleted, check whether the execution time is this time 
      // The time limit has been reached. If the time limit has been reached, record that the execution has reached the time limit and exits 
      iteration++;
      if ((iteration & 0xf) == 0 &&
        (ustime()-start) > timelimit)
      {
        timelimit_exit = 1;
        return;
      }
    // If the number of failed primary keys as a percentage of the sample is greater than 25% , then continue the sampling deletion process 
    } while (expired > REDIS_EXPIRELOOKUPS_PER_CRON/4);
  }
}

Related articles: