Python USES sparse matrices to save memory instances
- 2020-04-02 13:45:32
- OfStack
Recommendation systems often need to deal with data such as user_id, item_id and rating, which is actually a sparse matrix in mathematics. Sparse module is provided in scipy to solve this problem, but scipy.sparse has many problems that are not suitable for use:
1. Data [I,...] cannot be well supported at the same time. , data[..., j], data[I, j] quick section;
2. Because the data is stored in memory, it cannot support the mass data processing well.
To support data[I,...] , data[..., j], which requires the data set storage of I or j; At the same time, in order to save a large amount of data, also need to put a part of the data on the hard disk, with memory buffer. The solution here is relatively simple, with a dict-like thing to store data, for a certain I (such as 9527), its data is stored in Dict ['i9527'], the same, for a certain j (such as 3306), all its data is stored in Dict ['j3306'], need to take out data[9527,...] Dict ['i9527'] was originally a dict object and stored the value corresponding to a j. In order to save memory space, we stored the dict in the form of binary string and directly added the code:
'''
Sparse Matrix
'''
import struct
import numpy as np
import bsddb
from cStringIO import StringIO
class DictMatrix():
def __init__(self, container = {}, dft = 0.0):
self._data = container
self._dft = dft
self._nums = 0
def __setitem__(self, index, value):
try:
i, j = index
except:
raise IndexError('invalid index')
ik = ('i%d' % i)
# To save memory, let's put j, value Package as a word binary string
ib = struct.pack('if', j, value)
jk = ('j%d' % j)
jb = struct.pack('if', i, value)
try:
self._data[ik] += ib
except:
self._data[ik] = ib
try:
self._data[jk] += jb
except:
self._data[jk] = jb
self._nums += 1
def __getitem__(self, index):
try:
i, j = index
except:
raise IndexError('invalid index')
if (isinstance(i, int)):
ik = ('i%d' % i)
if not self._data.has_key(ik): return self._dft
ret = dict(np.fromstring(self._data[ik], dtype = 'i4,f4'))
if (isinstance(j, int)): return ret.get(j, self._dft)
if (isinstance(j, int)):
jk = ('j%d' % j)
if not self._data.has_key(jk): return self._dft
ret = dict(np.fromstring(self._data[jk], dtype = 'i4,f4'))
return ret
def __len__(self):
return self._nums
def __iter__(
Test code:
import timeit
timeit.Timer('foo = __main__.data[9527, ...]', 'import __main__').timeit(number = 1000)
It takes 1.4788 seconds to read a piece of data, which is about 1.5ms.
Another advantage of using dict-like to store data is that you can use Dict or any other form of DBM, even the legendary Tokyo Cabinet... .
All right, that's it.