Realization of Proportional Random Data Segmentation in python
- 2021-07-13 05:55:00
- OfStack
In machine learning or deep learning, we often encounter a problem that is data set segmentation. For example, in a competition, the organizer only gave us a labeled training set and an unlabeled test set. The training set is used for training, while the test set is used for running out a result on the trained model, then submitting it, and then the organizer verifies the result and gives a score. However, in the training process, we may have some problems such as over-fitting, and we will face the choice of algorithm and model. At this time, the verification set is very important. Usually, if the amount of data is sufficient, we will divide 1 certain proportion of data from the training set as the verification set.
Every time you divide the data set, you write a script manually, which is too repetitive, so you put this simple script on your blog. The code is as follows:
import random
def split(full_list,shuffle=False,ratio=0.2):
n_total = len(full_list)
offset = int(n_total * ratio)
if n_total==0 or offset<1:
return [],full_list
if shuffle:
random.shuffle(full_list)
sublist_1 = full_list[:offset]
sublist_2 = full_list[offset:]
return sublist_1,sublist_2
if __name__ == "__main__":
li = range(5)
sublist_1,sublist_2 = split(li,shuffle=True,ratio=0.2)
print sublist_1,len(sublist_1)
print sublist_2,len(sublist_2)
Where main is the test code. If the training set gives a file, we first read the file into the list, and then call split.