Python Realization of a Weighted Random Selecting Function without Reversal

  • 2021-07-26 08:08:25
  • OfStack

Demand

There is a lottery application, which extracts K winning users (K= number of prizes) from all participating users, and takes the number of lottery codes owned by each user as the weight.

Suppose there are three users and their weights are: A (1), B (1), C (2). The probability of drawing A is 25%, the probability of drawing B is 25%, and the probability of drawing C is 50%.

Analysis

It is intuitive to put two C into the list and select them, such as [A, B, C, C], and use the built-in functions random. choice [A, B, C, C] of Python, so that the probability of C is 50%.

The problem with this method is that when the weight is relatively large, it wastes memory space.

A more general method is to add the weight sum by 4, and then randomly select one value from the [0, 4] interval, and occupy A, B and C in different sizes. [0, 1) is A, [1, 2) is B, and [2, 4) is C.

random. ranint (0, 3) or int (random. random () * 4) can be used to produce a random integer R of 0-3. It is determined which user is selected when the R is in which interval.

The next step is to find out which interval the random number is in.

One method is to traverse the list in sequence and save the traversed element weight synthesis S. If S is greater than R, the current element will be returned.


from operator import itemgetter

users = [('A', 1), ('B', 1), ('C', 2)]

total = sum(map(itemgetter(1), users))

rnd = int(random.random()*total) # 0~3

s = 0
for u, w in users:
  s += w
  if s > rnd:
   return u

However, the complexity of this method is O (N), because all users are traversed.

Think of another method. First, list the accumulated weights in order, and then search them by 2-point method. The complexity of 2-point method is reduced to O (logN) (except other treatments)


users = [('A', 1), ('B', 1), ('C', 2)]

cum_weights = list(itertools.accumulate(map(itemgetter(1), users))) # [1, 2, 4]

total = cum_weights[-1]

rnd = int(random.random()*total) # 0~3

hi = len(cum_weights) - 1
index = bisect.bisect(cum_weights, rnd, 0, hi)

return users(index)[0]

This is how the choices function (available after version 3.6) of the Python built-in library random is implemented. The random. choices function signature is random. choices (population, weights=None, *, cum_weights=None, k=1) population is the list to be selected, weights is the respective weight, cum_weights is the optional calculated cumulative weight (choose one of the two), and k is the number of selections (with back selections). The source code is as follows:


def choices(self, population, weights=None, *, cum_weights=None, k=1):
  """Return a k sized list of population elements chosen with replacement.
  If the relative weights or cumulative weights are not specified,
  the selections are made with equal probability.
  """
  random = self.random
  if cum_weights is None:
    if weights is None:
      _int = int
      total = len(population)
      return [population[_int(random() * total)] for i in range(k)]
    cum_weights = list(_itertools.accumulate(weights))
  elif weights is not None:
    raise TypeError('Cannot specify both weights and cumulative weights')
  if len(cum_weights) != len(population):
    raise ValueError('The number of weights does not match the population')
  bisect = _bisect.bisect
  total = cum_weights[-1]
  hi = len(cum_weights) - 1
  return [population[bisect(cum_weights, random() * total, 0, hi)]
      for i in range(k)]

One step further

Because the built-in random. choices of Python is selected with reversal, the selection function without reversal is random. sample, but this function cannot be selected according to weight (random. sample (population, k).

The native random. sample can select more than one element without affecting the original list. It uses two algorithms to realize it, which ensures good performance in all cases. (Source address: random. sample)

The first is partial shuffle, which is returned when K elements are obtained. The time complexity is O (N), but the original sequence needs to be copied to increase memory usage.


result = [None] * k
n = len(population)
pool = list(population) #  Without changing the original sequence 
for i in range(k):
  j = int(random.random()*(n-i))
  result[k] = pool[j]
  pool[j] = pool[n-i-1] #  The selected elements are removed, and the unselected elements are filled in 
return result

The second is to set a selected set and randomly select it for many times. If the selected elements are in set, they will be selected again without copying a new sequence. When k is smaller than n, random. sample uses this algorithm, and the probability of repeated selection of elements is smaller.


selected = set()
selected_add = selected.add #  Accelerate method access 
for i in range(k):
  j = int(random.random()*n)
  while j in selected:
    j = int(random.random()*n)
  selected_add(j)
  result[j] = population[j]
return result

The lottery application needs a weighted and non-reversal algorithm, and a function weighted_sample is written in combination with the implementation of random.choices and random.sample.

The number of people in the general lottery is much larger than the number of prizes, so the second method of random. sample can be selected as the non-retrofitting lottery, which can certainly be optimized continuously.

The code is as follows:


def weighted_sample(population, weights, k=1):
  """Like random.sample, but add weights.
  """
  n = len(population)
  if n == 0:
    return []
  if not 0 <= k <= n:
    raise ValueError("Sample larger than population or is negative")
  if len(weights) != n:
    raise ValueError('The number of weights does not match the population')

  cum_weights = list(itertools.accumulate(weights))
  total = cum_weights[-1]
  if total <= 0: #  Prevention 1 Some wrong weights 
    return random.sample(population, k=k)
  hi = len(cum_weights) - 1

  selected = set()
  _bisect = bisect.bisect
  _random = random.random
  selected_add = selected.add
  result = [None] * k
  for i in range(k):
    j = _bisect(cum_weights, _random()*total, 0, hi)
    while j in selected:
      j = _bisect(cum_weights, _random()*total, 0, hi)
    selected_add(j)
    result[i] = population[j]
  return result


Related articles: