Bootstrap sampling is used to randomly retrieve data from a sample, allowing duplication, to create a slightly different population. For example, I repeat it 1000 times or so to get statistics. I thought about what to do with Pandas, so make a note of it.
Get samples of pandas and irises, and then import the random number module used for random sampling.
import pandas as pd
import random
from sklearn.datasets import load_iris
Then load the data and put it in the pandas data frame.
iris_dataset = load_iris()
df = pd.DataFrame(data=iris_dataset.data, columns=iris_dataset.feature_names)
Take a look at the data with df.describe ()
.
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.057333 3.758000 1.199333
std 0.828066 0.435866 1.765298 0.762238
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
Then define a function that randomly samples the data.
First, create an empty dataframe with pd.DataFrame (columns = a_data_frame.columns)
using the original dataframe columns
, and then create a random numberselected_num = random.choice (range (a_data_frame.shape [0)) there. ])) Add the frame
a_data_frame [selected_num: selected_num + 1]of the line selected by
with append
.
Note that it seems that you need to select a range ([0: 1]) to select a single line (for example, [0] for numpy) in the data frame of pandas.
def btstrap(a_data_frame):
btstr_data = pd.DataFrame(columns=a_data_frame.columns)
for a_data in range(a_data_frame.shape[0]):
selected_num = random.choice(range(a_data_frame.shape[0]))
btstr_data = btstr_data.append(a_data_frame[selected_num : selected_num + 1])
return btstr_data
Check the data after random sampling with btstr_data.describe ()
by doing btstr_data = btstrap (df)
.
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
count 150.000000 150.000000 150.000000 150.000000
mean 5.750000 3.040667 3.660667 1.176000
std 0.728034 0.410287 1.716634 0.766644
min 4.300000 2.000000 1.100000 0.100000
25% 5.100000 2.800000 1.500000 0.200000
50% 5.700000 3.000000 4.250000 1.300000
75% 6.300000 3.300000 5.000000 1.800000
max 7.700000 4.400000 6.700000 2.500000
Turn this 1000 times or in a loop to get the result of fitting or variable selection.
If you set replace = True
, it seems that you can do the same with .sample
, which is the original function of pandas.
@nkay Thank you for pointing out.
df.sample(n=df.shape[0], replace=True)
Recommended Posts