Entry summary

Stratified sampling is a technique for maintaining a good population distribution and sampling. In python, it is implemented in scikit-learn's StratifiedShuffleSplit and train_test_split. I'm often taken care of when cross-validating machine learning models. Instead of just using it as a tool, I implemented a sample code for learning in python and compared it with random sampling to deepen my understanding. As a result, it was confirmed that the smaller the number of samples from the extraction source, the better the accuracy of extraction compared to random sampling. (Although it is a natural result, I deepened my understanding by reproducing it with my own hands)

Stratified sampling

Simply put, it's a useful technique for sampling from a biased sample composition population. Divide the population into small groups called "layers". At that time, the dispersion for each layer is as small as possible, and the dispersion between layers is as large as possible. In other words, samples with the same attributes are grouped together.

For example, suppose you have 100 cards with a number from 0 to 9. Shuffle this.

0	1	9	・・・	5	3	7	1

Group this by the same number. It's a number, so you can sort it.

0	0	0	・・・	5	5	5	5	・・・	9	9	9	9

After grouping by the same number like this, sample from each group at the same ratio. For example, if the sampling rate is 10%, random sampling will be performed from each group with numbers 0 to 9 with a 10% probability. If there are 10 numbers from 0 to 9, random sampling is performed as follows.

0	0	0	・・・	0	→ Randomly extract one sheet
1	1	1	・・・	1	→ Randomly extract one sheet
2	2	2	・・・	2	→ Randomly extract one sheet
			・・・
9	9	9	・・・	9	→ Randomly extract one sheet

This stratified sampling is effective for biased sample configurations.

Extraction from biased sample composition

As an easy-to-understand example, let's say that out of 20 cards, 2 are "0" and the remaining 18 are "1".

0	0	1	1	・・・	1	1	1	1	1

(1) When randomly sampling from the whole

Let's say you randomly sample the whole thing and choose 10 of them. The extraction rate is 50%. Since it is random, it may happen that 10 out of 18 "1" s are selected and "0" is not extracted. In that case, the sample after extraction does not contain any "0", so the distribution cannot be maintained.

0	0	1	1	・・・	1	1	1	1	1

↓ ↓ ↓ ↓ Extract 10 out of 20

1	1	1	1	1	1	1	1	1	1

→ ** Since there is no "0", it is clearly different from the distribution of the population! ** **

Therefore, stratified sampling is performed.

(2) In the case of stratified sampling

For stratified sampling, extract as follows.

Extract one out of two "0" (= extraction rate 50%)
Extract 9 out of 18 "1" (= extraction rate 50%)

0	0	1	1	・・・	1	1	1	1	1

↓ ↓ ↓ ↓ Extract 10 out of 20

0	1	1	1	1	1	1	1	1	1

→ ** The composition ratio of "0" and "1" is the same as the population = the distribution is the same! ** **

The samples thus extracted maintain a better population distribution than if they were randomly selected from the whole.

Implement stratified sampling

Now that we have an overview, we will implement stratified sampling. The processing logic is as follows.

--Suppose a given sample consists of K layers. --Let the extraction ratio be r. (0 <r <1) --Extract at least one from each layer. --After extracting one from each layer, random sampling is performed for each layer. --Let the layer number of each layer be i. (i = 1,2,3, .., K) --The number of samples in each layer is N (i). --Let's assume that the sampling rate for random sampling in each layer is X (i). (0 <X (i) <1; i = 1,2,3, .., K) --Since one is always extracted in each layer, random sampling is performed for the remaining N (i) --1 at the extraction rate X (i). ――The relationship between the number of extracts is as follows.

1 + X(i) x (N(i) - 1) = N(i) x r --Therefore, the random sampling rate X (i) for each layer is as follows.
X(i) =( N(i) x r - 1) / ( N(i) - 1)

The code implemented as a sample based on the above logic is as follows.

import numpy as np
import random

def extract_stratified_sampling_result(ratio, base_samples):
    u"""
Stratified sampling is performed from a finite population by specifying the extraction ratio.
    :param ratio:Extraction ratio 0 to 1.0
    :param base_sample:Extraction source group
    :return:
    """
    #First, take out one from each group of numbers.
    #Then, it is randomly selected from each number group to bring the composition ratio closer to the population.
    #X extraction rate after extracting one from each number group(i)And. i is the group number.
    #N for the number of each digit group(i)And.
    #The number to be extracted is ratio x N(i)Is.
    #I have already taken out one, so it remains( N(i) - 1 )Extraction rate X from the pieces(i)Take out at random.
    #Therefore, when combined with the one already taken out, ratio x N(i)Become an individual.
    # X(i) x (N(i) - 1) + 1 = ratio x N(i)
    # X(i) = (ratio x N(i) - 1 )/(N(i) - 1)Is.

    block_count = np.bincount(base_samples)
    x = (ratio * block_count - 1) / (block_count - 1)

    #Calculate the random number threshold for sampling.
    #Threshold= 1.0 -Extraction rate of each group
    #Extract when the random number exceeds the threshold.
    #Extraction rate of a group is 0.If 3, then 1.0 - 0.3 = 0.7, random number is 0.If it is 7 or more, it will be extracted.
    #An array x containing the extraction rates for each number group,
    #Arrange as many numbers as there are numbers.
    threshold = np.repeat(1.0 - x, block_count)

    #Index list of each element when the original set is sorted
    #Extracting from samples in this order will result in sorting.
    sorted_pos = np.argsort(base_samples)

    #Start position of each number group
    block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

    #When the generated random number exceeds the threshold threshold, it is extracted.
    threshold[block_start] = 0  #The first element of each number group is always extracted
    extracted = []
    for i in range(len(base_samples)):
        each_rand = random.random()
        if each_rand > threshold[i]:
            pos = sorted_pos[i]
            extracted.append(base_samples[pos])
    extracted = np.array(extracted, dtype=np.int64)
    return extracted

It is assumed that the argument base_samples contains a list of integers. Let's take a look at each code. First, let's understand the structure of the group of the extraction source base_samples. That's where np.bincount () comes in. It aggregates the number of each number that makes up the list.

    
    block_count = np.bincount(base_samples)

For example, if base_samples contains 9 0s and 91 1s, block_count will return the following result:

[ 9 91]

In other words block_coount [0] = number of numbers 0, block_count [1] = number of numbers 1, That's why. This block_count corresponds to the number N (i) of each layer in the previous logic. Next, find the extraction rate X (i) for each layer. The extraction rate r corresponds to the argument ratio, so the code looks like this:

    x = (ratio * block_count - 1) / (block_count - 1)

Since block_count is a numpy array, the calculation result will also be a numpy array. That is, the contents of x look like this:

x [0] = Extraction rate of layer with number 0 x [1] = Extraction rate of the layer with the number 1

Now that x has been calculated, the next step is to find the random number threshold. If the random number is greater than or equal to this value, the sample is taken out. Therefore,

Random number threshold for each layer= 1.0 - X(i)

It will be. For example, if the extraction rate for the number 0 is 0.10, the threshold is

1.0 - 0.10 = 0.90

It will be.

In other words, the sample is taken out only when the generated random number (0 to 1.0) is 0.90 or more.

The random number thresholds for each layer obtained in this way are repeatedly arranged for the number of samples in each layer.

    #An array x containing the extraction rates for each number group,
    #Arrange as many numbers as there are numbers.
    threshold = np.repeat(1.0 - x, block_count)

x[0] = 0.20 , block_count[0] = 2 Assuming x [1] = 0.10, block_count [1] = 18, the list threshold of random number thresholds is:

threshold = [ 0.80, 0.80, 0.90, 0.90, 0.90, ... 0.90]

Next, get the index list after sorting base_samples.

    #Index list of each element when the original set is sorted
    #Extracting from samples in this order will result in sorting.
    sorted_pos = np.argsort(base_samples)

You can use a combination of threshold, sorted_pos, and base_samples for stratified sampling. For example

--threshold [0]: Random number threshold of the first number of layer 0 --sort_pos [0]: Position (= index) where the first number of layer 0 is stored on base_samples

Therefore, if the first random number generated is threshold [0] or higher, the first element of layer 0 is extracted. By moving the scanning position of threshold and sorted_pos, stratified sampling can be performed.

After that, we will process for the logic that one is always taken out from each layer.

    #Start position of each number group
    block_start = np.concatenate(([0], np.cumsum(block_count)[:-1]))

    #When the generated random number exceeds the threshold threshold, it is extracted.
    threshold[block_start] = 0  #The first element of each number group is always extracted

block_start contains the position of the first element of each layer. For example, if there are 10 0s and 90 1s, then:

block_start = [0,10]

Layers with the number 0 start at the block_start [0] th The layer with the number 1 means that it starts at the block_start [1] th.

Use this block_start to set the random number threshold of the first element of each layer to 0. A random number threshold of 0 means that it will always be extracted.

And since the random number threshold value after the beginning of each layer is (1-X (i)), it will be randomly extracted according to the extraction rate calculated.

The entry is likely to be long, so I'll conclude here.

The following entry presents sample code that compares the performance of simple random sampling with stratified sampling.