Redis ordered collection is used to implement IP home query detail

  • 2020-06-19 12:00:49
  • OfStack

In my work, I often meet the requirements of category 1. According to the address segment of IP, I will find the corresponding attribution information of IP. If the query process is placed in a relational database, it will cause a large consumption of IO, and the speed is not enough, which is clearly inappropriate.

So what are the better ways? To this end, I have made some attempts, the following details.

Build index file

I saw an ip2region project on GitHub. The author realized fast query by generating a file containing a level 2 index, which was fast enough and millisecond level. But if you want to update the address segment or the attribution information, it is not very convenient to have to regenerate the file every time.
However, I recommend you to take a look at this project, where the idea of building an index is worth learning. In the author's open source project, there are only relevant codes for query and no codes for index file generation. I wrote a section of code for index file generation according to the schematic diagram, as follows:


# -*- coding:utf-8 -*-


import time
import socket
import struct

IP_REGION_FILE = './data/ip_to_region.db'

SUPER_BLOCK_LENGTH = 8
INDEX_BLOCK_LENGTH = 12
HEADER_INDEX_LENGTH = 8192


def generate_db_file():
  pointer = SUPER_BLOCK_LENGTH + HEADER_INDEX_LENGTH

  region, index = '', ''

  #  The file format 
  # 1.0.0.0|1.0.0.255| Australia |0|0|0|0
  # 1.0.1.0|1.0.3.255| China |0| Fujian province | fuzhou | telecom 
  with open('./ip.merge.txt', 'r') as f:
    for line in f.readlines():
      item = line.strip().split('|')
      print item[0], item[1], item[2], item[3], item[4], item[5], item[6]
      start_ip = struct.pack('I', struct.unpack('!L', socket.inet_aton(item[0]))[0])
      end_ip = struct.pack('I', struct.unpack('!L', socket.inet_aton(item[1]))[0])
      region_item = '|'.join([item[2], item[3], item[4], item[5], item[6]])
      region += region_item

      ptr = struct.pack('I', int(bin(len(region_item))[2:].zfill(8) + bin(pointer)[2:].zfill(24), 2))
      index += start_ip + end_ip + ptr
      pointer += len(region_item)

  index_start_ptr = pointer
  index_end_ptr = pointer + len(index) - 12
  super_block = struct.pack('I', index_start_ptr) + struct.pack('I', index_end_ptr)

  n = 0
  header_index = ''
  for index_block in range(pointer, index_end_ptr, 8184):
    header_index_block_ip = index[n * 8184:n * 8184 + 4]
    header_index_block_ptr = index_block
    header_index += header_index_block_ip + struct.pack('I', header_index_block_ptr)

    n += 1

  header_index += index[len(index) - 12: len(index) - 8] + struct.pack('I', index_end_ptr)

  with open(IP_REGION_FILE, 'wb') as f:
    f.write(super_block)
    f.write(header_index)
    f.seek(SUPER_BLOCK_LENGTH + HEADER_INDEX_LENGTH, 0)
    f.write(region)
    f.write(index)


if __name__ == '__main__':
  start_time = time.time()
  generate_db_file()

  print 'cost time: ', time.time() - start_time

Use the Redis cache

There are two ways to cache IP and the location information:

The first is to convert the starting IP, ending IP and all the middle IP into integer type, and then use the converted IP as key in the form of a string, and store the attribution information as value in Redis.

The second way is to use the ordered set and hash. First, the starting and ending IP are added to the ordered set ip2cityid, city ID as members, and the converted IP as the score. Then, the city ID and the attribution information are added to the hash cityid2city, city ID as key, and the attribution information as value.

The first way is not to do much introduction, simple and rough, is not recommended. Of course, the query speed is very fast, millisecond level, but the disadvantage is also 10 points obvious, I used 1000 pieces of data to do the test, cache time is long, about 20 minutes, take up a lot of space, nearly 1G.

Here's a second way to look directly at the code:


# generate_to_redis.py
# -*- coding:utf-8 -*-

import time
import json
from redis import Redis


def ip_to_num(x):
  return sum([256 ** j * int(i) for j, i in enumerate(x.split('.')[::-1])])


#  The connection  Redis
conn = Redis(host='127.0.0.1', port=6379, db=10)

start_time = time.time()

#  The file format 
# 1.0.0.0|1.0.0.255| Australia |0|0|0|0
# 1.0.1.0|1.0.3.255| China |0| Fujian province | fuzhou | telecom 
with open('./ip.merge.txt', 'r') as f:
  i = 1
  for line in f.readlines():
    item = line.strip().split('|')
    #  Will start  IP  And the end  IP  Add to ordered collections  ip2cityid
    #  The members are cities  ID  and  ID + #,  The points are based on  IP  The integral value of the calculation 
    conn.zadd('ip2cityid', str(i), ip_to_num(item[0]), str(i) + '#', ip_to_num(item[1]) + 1)
    #  Adds the city information to the hash  cityid2city . key  Is the city  ID The value is the city information  json  The sequence 
    conn.hset('cityid2city', str(i), json.dumps([item[2], item[3], item[4], item[5]]))

    i += 1

end_time = time.time()

print 'start_time: ' + str(start_time) + ', end_time: ' + str(end_time) + ', cost time: ' + str(end_time - start_time)


# test.py
# -*- coding:utf-8 -*-

import sys
import time
import json
import socket
import struct
from redis import Redis

#  The connection  Redis
conn = Redis(host='127.0.0.1', port=6379, db=10)

#  will  IP  Convert to integer 
ip = struct.unpack("!L", socket.inet_aton(sys.argv[1]))[0]

start_time = time.time()
#  Sort the ordered set from large to small, less than the input  IP  Value of the first 1 The data 
cityid = conn.zrevrangebyscore('ip2cityid', ip, 0, start=0, num=1)
#  If the return  cityid  It's empty, or it's matched  #  No., indicating that the corresponding address segment was not found 
if not cityid or cityid[0].endswith('#'):
  print 'no city info...'
else:
  #  According to the city  ID  Go to the hash table and get the city information 
  ret = json.loads(conn.hget('cityid2city', cityid[0]))
  print ret[0], ret[1], ret[2]

end_time = time.time()
print 'start_time: ' + str(start_time) + ', end_time: ' + str(end_time) + ', cost time: ' + str(end_time - start_time)


# python generate_to_redis.py 
start_time: 1554300310.31, end_time: 1554300425.65, cost time: 115.333260059

# python test_2.py 1.0.16.0
 Japan  0 0
start_time: 1555081532.44, end_time: 1555081532.45, cost time: 0.000912189483643

The test data is about 500,000 pieces, the caching time is less than 2 minutes, the memory is occupied by 182M, and the query speed is millisecond level. Obviously, this approach is worth trying.

The time complexity of zrevrangebyscore method is O(log(N)+M), N is the cardinality of ordered set, M is the cardinality of result set. It can be seen that the larger the value of N is, the slower the query efficiency will be, and the specific amount of data can also be efficiently inquired, which needs to be verified. But I don't think I need to worry about this problem.


Related articles: