Redis ordered collection is used to implement IP home query detail
- 2020-06-19 12:00:49
- OfStack
In my work, I often meet the requirements of category 1. According to the address segment of IP, I will find the corresponding attribution information of IP. If the query process is placed in a relational database, it will cause a large consumption of IO, and the speed is not enough, which is clearly inappropriate.
So what are the better ways? To this end, I have made some attempts, the following details.
Build index file
I saw an ip2region project on GitHub. The author realized fast query by generating a file containing a level 2 index, which was fast enough and millisecond level. But if you want to update the address segment or the attribution information, it is not very convenient to have to regenerate the file every time.
However, I recommend you to take a look at this project, where the idea of building an index is worth learning. In the author's open source project, there are only relevant codes for query and no codes for index file generation. I wrote a section of code for index file generation according to the schematic diagram, as follows:
# -*- coding:utf-8 -*-
import time
import socket
import struct
IP_REGION_FILE = './data/ip_to_region.db'
SUPER_BLOCK_LENGTH = 8
INDEX_BLOCK_LENGTH = 12
HEADER_INDEX_LENGTH = 8192
def generate_db_file():
pointer = SUPER_BLOCK_LENGTH + HEADER_INDEX_LENGTH
region, index = '', ''
# The file format
# 1.0.0.0|1.0.0.255| Australia |0|0|0|0
# 1.0.1.0|1.0.3.255| China |0| Fujian province | fuzhou | telecom
with open('./ip.merge.txt', 'r') as f:
for line in f.readlines():
item = line.strip().split('|')
print item[0], item[1], item[2], item[3], item[4], item[5], item[6]
start_ip = struct.pack('I', struct.unpack('!L', socket.inet_aton(item[0]))[0])
end_ip = struct.pack('I', struct.unpack('!L', socket.inet_aton(item[1]))[0])
region_item = '|'.join([item[2], item[3], item[4], item[5], item[6]])
region += region_item
ptr = struct.pack('I', int(bin(len(region_item))[2:].zfill(8) + bin(pointer)[2:].zfill(24), 2))
index += start_ip + end_ip + ptr
pointer += len(region_item)
index_start_ptr = pointer
index_end_ptr = pointer + len(index) - 12
super_block = struct.pack('I', index_start_ptr) + struct.pack('I', index_end_ptr)
n = 0
header_index = ''
for index_block in range(pointer, index_end_ptr, 8184):
header_index_block_ip = index[n * 8184:n * 8184 + 4]
header_index_block_ptr = index_block
header_index += header_index_block_ip + struct.pack('I', header_index_block_ptr)
n += 1
header_index += index[len(index) - 12: len(index) - 8] + struct.pack('I', index_end_ptr)
with open(IP_REGION_FILE, 'wb') as f:
f.write(super_block)
f.write(header_index)
f.seek(SUPER_BLOCK_LENGTH + HEADER_INDEX_LENGTH, 0)
f.write(region)
f.write(index)
if __name__ == '__main__':
start_time = time.time()
generate_db_file()
print 'cost time: ', time.time() - start_time
Use the Redis cache
There are two ways to cache IP and the location information:
The first is to convert the starting IP, ending IP and all the middle IP into integer type, and then use the converted IP as key in the form of a string, and store the attribution information as value in Redis.
The second way is to use the ordered set and hash. First, the starting and ending IP are added to the ordered set ip2cityid, city ID as members, and the converted IP as the score. Then, the city ID and the attribution information are added to the hash cityid2city, city ID as key, and the attribution information as value.
The first way is not to do much introduction, simple and rough, is not recommended. Of course, the query speed is very fast, millisecond level, but the disadvantage is also 10 points obvious, I used 1000 pieces of data to do the test, cache time is long, about 20 minutes, take up a lot of space, nearly 1G.
Here's a second way to look directly at the code:
# generate_to_redis.py
# -*- coding:utf-8 -*-
import time
import json
from redis import Redis
def ip_to_num(x):
return sum([256 ** j * int(i) for j, i in enumerate(x.split('.')[::-1])])
# The connection Redis
conn = Redis(host='127.0.0.1', port=6379, db=10)
start_time = time.time()
# The file format
# 1.0.0.0|1.0.0.255| Australia |0|0|0|0
# 1.0.1.0|1.0.3.255| China |0| Fujian province | fuzhou | telecom
with open('./ip.merge.txt', 'r') as f:
i = 1
for line in f.readlines():
item = line.strip().split('|')
# Will start IP And the end IP Add to ordered collections ip2cityid
# The members are cities ID and ID + #, The points are based on IP The integral value of the calculation
conn.zadd('ip2cityid', str(i), ip_to_num(item[0]), str(i) + '#', ip_to_num(item[1]) + 1)
# Adds the city information to the hash cityid2city . key Is the city ID The value is the city information json The sequence
conn.hset('cityid2city', str(i), json.dumps([item[2], item[3], item[4], item[5]]))
i += 1
end_time = time.time()
print 'start_time: ' + str(start_time) + ', end_time: ' + str(end_time) + ', cost time: ' + str(end_time - start_time)
# test.py
# -*- coding:utf-8 -*-
import sys
import time
import json
import socket
import struct
from redis import Redis
# The connection Redis
conn = Redis(host='127.0.0.1', port=6379, db=10)
# will IP Convert to integer
ip = struct.unpack("!L", socket.inet_aton(sys.argv[1]))[0]
start_time = time.time()
# Sort the ordered set from large to small, less than the input IP Value of the first 1 The data
cityid = conn.zrevrangebyscore('ip2cityid', ip, 0, start=0, num=1)
# If the return cityid It's empty, or it's matched # No., indicating that the corresponding address segment was not found
if not cityid or cityid[0].endswith('#'):
print 'no city info...'
else:
# According to the city ID Go to the hash table and get the city information
ret = json.loads(conn.hget('cityid2city', cityid[0]))
print ret[0], ret[1], ret[2]
end_time = time.time()
print 'start_time: ' + str(start_time) + ', end_time: ' + str(end_time) + ', cost time: ' + str(end_time - start_time)
# python generate_to_redis.py
start_time: 1554300310.31, end_time: 1554300425.65, cost time: 115.333260059
# python test_2.py 1.0.16.0
Japan 0 0
start_time: 1555081532.44, end_time: 1555081532.45, cost time: 0.000912189483643
The test data is about 500,000 pieces, the caching time is less than 2 minutes, the memory is occupied by 182M, and the query speed is millisecond level. Obviously, this approach is worth trying.
The time complexity of zrevrangebyscore method is O(log(N)+M), N is the cardinality of ordered set, M is the cardinality of result set. It can be seen that the larger the value of N is, the slower the query efficiency will be, and the specific amount of data can also be efficiently inquired, which needs to be verified. But I don't think I need to worry about this problem.