A little script for Bootstrap sampling in Pandas

Bootstrap sampling is used to randomly retrieve data from a sample, allowing duplication, to create a slightly different population. For example, I repeat it 1000 times or so to get statistics. I thought about what to do with Pandas, so make a note of it.

Try using an iris sample

Get samples of pandas and irises, and then import the random number module used for random sampling.

import pandas as pd
import random
from sklearn.datasets import load_iris

Then load the data and put it in the pandas data frame.

iris_dataset = load_iris()
df = pd.DataFrame(data=iris_dataset.data, columns=iris_dataset.feature_names)

Take a look at the data with df.describe ().

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

Then define a function that randomly samples the data. First, create an empty dataframe with pd.DataFrame (columns = a_data_frame.columns) using the original dataframe columns, and then create a random numberselected_num = random.choice (range (a_data_frame.shape [0)) there. ])) Add the framea_data_frame [selected_num: selected_num + 1]of the line selected bywith append. Note that it seems that you need to select a range ([0: 1]) to select a single line (for example, [0] for numpy) in the data frame of pandas.

def btstrap(a_data_frame):
    btstr_data = pd.DataFrame(columns=a_data_frame.columns)
    for a_data in range(a_data_frame.shape[0]):
        selected_num = random.choice(range(a_data_frame.shape[0]))
        btstr_data = btstr_data.append(a_data_frame[selected_num : selected_num + 1])
    return btstr_data

Check the data after random sampling with btstr_data.describe () by doing btstr_data = btstrap (df).

sepal length (cm)	sepal width (cm)	petal length (cm)	petal width (cm)
count	150.000000	150.000000	150.000000	150.000000
mean	5.750000	3.040667	3.660667	1.176000
std	0.728034	0.410287	1.716634	0.766644
min	4.300000	2.000000	1.100000	0.100000
25%	5.100000	2.800000	1.500000	0.200000
50%	5.700000	3.000000	4.250000	1.300000
75%	6.300000	3.300000	5.000000	1.800000
max	7.700000	4.400000	6.700000	2.500000

Turn this 1000 times or in a loop to get the result of fitting or variable selection.

[See below] There was an easier way

If you set replace = True, it seems that you can do the same with .sample, which is the original function of pandas. @nkay Thank you for pointing out.

df.sample(n=df.shape[0], replace=True)

Bootstrap sampling with Pandas

A little script for Bootstrap sampling in Pandas

Try using an iris sample

[See below] There was an easier way