How to realize automatic completion with Redis

2020-06-23 02:14:31
OfStack

Forget which version redis is started from, you can give hints based on partial command prefixes entered, that is, automatic completion. The next step is to implement this cool feature based on redis.

about sorted set

Suppose you have mara, marabel, marcela. Now we type in mar, and we get these three names, and the output is sorted lexicographically. Before implementing this requirement, let's take a brief look at sorted set.

Everyone knows that sorted set is sorted by score:


127.0.0.1:6380> zadd test 85 sida
127.0.0.1:6380> zadd test 80 xiaolang
127.0.0.1:6380> zadd test 60 afei
127.0.0.1:6380> zadd test 90 chenssy
127.0.0.1:6380> zadd test 98 yunaiv
127.0.0.1:6380> zrange test 0 -1
1) "afei"
2) "xiaolang"
3) "sida"
4) "chenssy"
5) "yunaiv"

But if score is all 1, sorted set in what order? It's sorted by dictionary:


127.0.0.1:6380> zadd exam 0 sida
127.0.0.1:6380> zadd exam 0 xiaolang
127.0.0.1:6380> zadd exam 0 chenssy
127.0.0.1:6380> zadd exam 0 yunaiv
127.0.0.1:6380> zadd exam 0 afei
127.0.0.1:6380> zrange exam 0 -1
1) "afei"
2) "chenssy"
3) "sida"
4) "xiaolang"
5) "yunaiv"

This is a very important feature of sorted set1 and a key element of our automated completion requirements. But that's not enough. We still have a long way to go before we get to the final demand. Fortunately, sorted set has another cool command: ZRANK. This command will tell you where key is in sorted set:


127.0.0.1:6380> zrank exam sida
(integer) 2
127.0.0.1:6380> zrank exam yunaiv
(integer) 4
127.0.0.1:6380> zrange exam 2 4
1) "sida"
2) "xiaolang"
3) "yunaiv"

At this point, we feel very close to the first version of our implementation of automatic completion. We can get any member in sorted set sorted by dictionary and N after member.

Simple implementation

In order to achieve the final automatic completion, we need to pay a price: space.

This means that for some member to be added to sorted set, such as afei, we not only add the full word (afei) to sorted set, but also all possible prefixes (a, af, afe, afei). To figure out whether a word is a real member or a prefix of an member (for example, bar is both a full word and a possible prefix of an member such as bark), we add the trick of adding a special character after the real member, such as "$", so that afei will add these words:


a, af, afe, afei, afei$

Now suppose we need to add 3 words: foo, bar, foobar to perform 1 test, then sorted set will have the following words:


127.0.0.1:6380> zrange autoc 0 -1
 1) "b"
 2) "ba"
 3) "bar"
 4) "bar$"
 5) "f"
 6) "fo"
 7) "foo"
 8) "foo$"
 9) "foob"
10) "fooba"
11) "foobar"
12) "foobar$"

A lot closer to our ultimate goal. Now suppose the user enters "fo", then to achieve automatic completion, we need to execute the following command, and carefully look at the results, foo$and foobar$are the results we need, just need to remove the special suffix $(other words not ending with $are ignored) :


127.0.0.1:6380> zrank autoc fo
(integer) 5
127.0.0.1:6380> zrange autoc 5 -1
1) "fo"
2) "foo"
3) "foo$"
4) "foob"
5) "fooba"
6) "foobar"
7) "foobar$"

More word tests

Url http: / / antirez com misc/female - names. txt provided nearly 5000 female names. Add all possible prefixes of all names to sorted set as described earlier. Assuming that the user enters member, the following two commands are required to obtain the 50 words after the lexicographical ordering of member entered by the user, and then only the words ending in $:


127.0.0.1:6380> zrank autoc member
(integer) 8
127.0.0.1:6380> zrange autoc 8 50

time & Space complexity

According to official documents, the event complexity of both zrank and zrange is O(log(N)). Therefore, even if the data volume is relatively large, this scheme is feasible.

[

ZRANK key member
Initial version: 2.0.0
Time complexity: O(log(N))

ZRANGE key start stop [WITHSCORES]
Initial version: 1.2.0
Time complexity: O(log(N)+M)

]

So how much memory is needed?

Suppose in the worst case, a word of length M requires the addition of M+1 word to sorted set. So if there are N words, a total of N*(Ma+1) words need to be added to sorted set, and Ma is the average length of the N words.

Fortunately, the reality is much better than that. For nearly 5,000 women's names, for example, only about 15,000 words need to be added to sorted set, because of the large number of repeated prefixes. Take three names for example: marci, marcia, marcile. According to the worst-case scenario, 3*(6+1)=21 words need to be added to sorted set. However, in reality, only 11 words need to be added to sorted set:


m, ma , mar, marc, marci, marci$, marcia, marcia$, marcil, marcile, marcile$ .

Moreover, as the sample gets larger, the number of repeated prefixes increases.

Therefore, the required memory is also OK. Do you feel good? ha

Queries to predict

Our autocompletion is in lexicographical order, and a lot of times, that's what we need. But there are cases where we want to sort by similarity. Like the google search, search results are sorted in terms of frequency and heat. For example, when a user enters ne, the user should expect to see popular keywords such as Netflix, news, new york times.

This is not easy to do. To sort by frequency, we need a special sorted set that updates the frequency of each prefix in real time in a non-blocking manner:

[

When the user enters a query such as "foo", all of its prefixes are: "f", "fo", "foo". Then execute the command for each prefix: ZINCRBY < prefix > 1 foo; If the user enters "foobar", each prefix is executed: ZINCRBY < prefix > 1 foobar;

]

One problem with this approach is that many automated completion systems only need to show TOP N keywords, assuming N is 10. But in our method, if we want to calculate TOP 10, we need to get far more than 10 keywords, and in theory we want this prefix to be taken out of sorted set for all member.

Fortunately, statistically speaking, each of sorted set has 300 member prefixes, and you get the TOP 5 keyword. The more frequent the query, the higher its score, and the higher its probability of being selected. So, all we have to do is search for each prefix in the string:

If the number of member in sorted set prefixed as key does not reach 300, then simply use zincrby. If the number of member in set prefixed as key has reached 300, we will delete those member with low scores and add new member in, so as to achieve the effect of constant iteration of keyword frequency. [

This method 1 suddenly may not understand come over, never mind, for example chestnut. Suppose that the user now enters next, and that netflix(100), news(120), new york(80), near(23), nequ(1) are already in sorted set whose prefix n is key. Since the number of member in sorted set corresponding to this prefix is not yet 300, perform: zincrby n 1 next. n is the prefix and next is the keyword entered by the user. Suppose that the user now enters next, whose prefix n is sorted set of key, there are netflix(100), news(120), new york(80), near(23), and 295 score greater than 1 keywords are omitted, nequ(1). As the number of member corresponding to the prefix sorted set reaches 300, the low score nequ is deleted first, and then executed: zincrby n 1 next.

]

This method is closely related to the distribution of keywords entered by users. If the frequency of all keywords entered by users is similar, the data obtained by this method is not very reliable. But we know that's not a problem, because search is what most people are searching for in that small set of keywords.

Clean up the stage

Because of the long tail of search mentioned above, we can eliminate member with a score of 1, because users pay little attention to these keywords. The cleanup also reduces memory usage, allowing redis to hold more and more useful data.

Knowledge summary

In the sorted set data structure, if member looks like score1, sort by dictionary. The zrank command gets the specified position of member in the result and can rank N member after it. zincrby gives points to member in the specified sorted set.

Reference: http: / / oldblog. antirez. com post/autocomplete - with - redis. html

conclusion