Python USES sparse matrices to save memory instances

  • 2020-04-02 13:45:32
  • OfStack

Recommendation systems often need to deal with data such as user_id, item_id and rating, which is actually a sparse matrix in mathematics. Sparse module is provided in scipy to solve this problem, but scipy.sparse has many problems that are not suitable for use:

1. Data [I,...] cannot be well supported at the same time. , data[..., j], data[I, j] quick section;
2. Because the data is stored in memory, it cannot support the mass data processing well.

To support data[I,...] , data[..., j], which requires the data set storage of I or j; At the same time, in order to save a large amount of data, also need to put a part of the data on the hard disk, with memory buffer. The solution here is relatively simple, with a dict-like thing to store data, for a certain I (such as 9527), its data is stored in Dict ['i9527'], the same, for a certain j (such as 3306), all its data is stored in Dict ['j3306'], need to take out data[9527,...] Dict ['i9527'] was originally a dict object and stored the value corresponding to a j. In order to save memory space, we stored the dict in the form of binary string and directly added the code:


'''
Sparse Matrix
'''
import struct
import numpy as np
import bsddb
from cStringIO import StringIO
 
class DictMatrix():
    def __init__(self, container = {}, dft = 0.0):
        self._data  = container
        self._dft   = dft
        self._nums  = 0
 
    def __setitem__(self, index, value):
        try:
            i, j = index
        except:
            raise IndexError('invalid index')
 
        ik = ('i%d' % i)
        # To save memory, let's put j, value Package as a word binary string
        ib = struct.pack('if', j, value)
        jk = ('j%d' % j)
        jb = struct.pack('if', i, value)
 
        try:
            self._data[ik] += ib
        except:
            self._data[ik] = ib
        try:
            self._data[jk] += jb
        except:
            self._data[jk] = jb
        self._nums += 1
 
    def __getitem__(self, index):
        try:
            i, j = index
        except:
            raise IndexError('invalid index')
 
        if (isinstance(i, int)):
            ik = ('i%d' % i)
            if not self._data.has_key(ik): return self._dft
            ret = dict(np.fromstring(self._data[ik], dtype = 'i4,f4'))
            if (isinstance(j, int)): return ret.get(j, self._dft)
 
        if (isinstance(j, int)):
            jk = ('j%d' % j)
            if not self._data.has_key(jk): return self._dft
            ret = dict(np.fromstring(self._data[jk], dtype = 'i4,f4'))
 
        return ret
 
    def __len__(self):
        return self._nums
 
    def __iter__(

Test code:


import timeit
timeit.Timer('foo = __main__.data[9527, ...]', 'import __main__').timeit(number = 1000)

It takes 1.4788 seconds to read a piece of data, which is about 1.5ms.
Another advantage of using dict-like to store data is that you can use Dict or any other form of DBM, even the legendary Tokyo Cabinet... .

All right, that's it.


Related articles: