Realization of Proportional Random Data Segmentation in python

2021-07-13 05:55:00
OfStack

In machine learning or deep learning, we often encounter a problem that is data set segmentation. For example, in a competition, the organizer only gave us a labeled training set and an unlabeled test set. The training set is used for training, while the test set is used for running out a result on the trained model, then submitting it, and then the organizer verifies the result and gives a score. However, in the training process, we may have some problems such as over-fitting, and we will face the choice of algorithm and model. At this time, the verification set is very important. Usually, if the amount of data is sufficient, we will divide 1 certain proportion of data from the training set as the verification set.

Every time you divide the data set, you write a script manually, which is too repetitive, so you put this simple script on your blog. The code is as follows:


import random

def split(full_list,shuffle=False,ratio=0.2):
  n_total = len(full_list)
  offset = int(n_total * ratio)
  if n_total==0 or offset<1:
    return [],full_list
  if shuffle:
    random.shuffle(full_list)
  sublist_1 = full_list[:offset]
  sublist_2 = full_list[offset:]
  return sublist_1,sublist_2


if __name__ == "__main__":
  li = range(5)
  sublist_1,sublist_2 = split(li,shuffle=True,ratio=0.2)

  print sublist_1,len(sublist_1)
  print sublist_2,len(sublist_2)

Where main is the test code. If the training set gives a file, we first read the file into the list, and then call split.