Method Steps of Python Realizing K Folding Cross validation Method

2021-07-13 05:45:59
OfStack

The error of the learner on the test set is usually called "generalization error". In order to get the "generalization error", the data set must be divided into training set and test set. So how to divide it? There are two commonly used methods, k fold cross-validation method and self-help method. There is a lot of information about these two methods. The following is the python implementation of k folding cross-validation method.


##1 A simple one 2 Folding cross validation 
from sklearn.model_selection import KFold
import numpy as np
X=np.array([[1,2],[3,4],[1,3],[3,5]])
Y=np.array([1,2,3,4])
KF=KFold(n_splits=2) # Establish 4 Folding cross-validation method   Check 1 Under KFold Parameters of the function 
for train_index,test_index in KF.split(X):
  print("TRAIN:",train_index,"TEST:",test_index)
  X_train,X_test=X[train_index],X[test_index]
  Y_train,Y_test=Y[train_index],Y[test_index]
  print(X_train,X_test)
  print(Y_train,Y_test)
# Summary: KFold This bag   Division k When folding cross-validation, it is based on TEST The order of the set is dominant, for example, if the division is 4 Folding cross-validation, then TEST The order of selection is [0].[1],[2],[3] . 

# Ascend 
import numpy as np
from sklearn.model_selection import KFold
#Sample=np.random.rand(50,15) # Establish 1 A 50 Row 12 Random array of columns 
Sam=np.array(np.random.randn(1000)) #1000 Random number 
New_sam=KFold(n_splits=5)
for train_index,test_index in New_sam.split(Sam): # Right Sam Data establishment 5 Partition of folding cross-validation 
#for test_index,train_index in New_sam.split(Sam): # Default number 1 The parameters are the training set, the first 2 Parameters are test sets 
  #print(train_index,test_index)
  Sam_train,Sam_test=Sam[train_index],Sam[test_index]
  print(' Number of training sets :',Sam_train.shape,' Number of test sets :',Sam_test.shape) # The results show the number of partitions per time 


#Stratified k-fold  Divide data by percentage 
from sklearn.model_selection import StratifiedKFold
import numpy as np
m=np.array([[1,2],[3,5],[2,4],[5,7],[3,4],[2,7]])
n=np.array([0,0,0,1,1,1])
skf=StratifiedKFold(n_splits=3)
for train_index,test_index in skf.split(m,n):
  print("train",train_index,"test",test_index)
  x_train,x_test=m[train_index],m[test_index]
#Stratified k-fold  Divide data by percentage 
from sklearn.model_selection import StratifiedKFold
import numpy as np
y1=np.array(range(10))
y2=np.array(range(20,30))
y3=np.array(np.random.randn(10))
m=np.append(y1,y2) # Generate 1000 Random number 
m1=np.append(m,y3)
n=[i//10 for i in range(30)] # Generate 25 Duplicate data 

skf=StratifiedKFold(n_splits=5)
for train_index,test_index in skf.split(m1,n):
  print("train",train_index,"test",test_index)
  x_train,x_test=m1[train_index],m1[test_index]

There seems to be no ready-made package of self-help method (Bootstrap) in Python, which may be because the principle of self-help method is not difficult, so it is not difficult to realize independently.