When I was thinking about random number generation, I was worried and couldn't sleep, so I summarized it.
In machine learning code, it is often reproducible by executing a function like this first.
seal_seed.py
def fix_seed(seed):
# random
random.seed(seed)
# Numpy
np.random.seed(seed)
# Pytorch
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
# Tensorflow
tf.random.set_seed(seed)
SEED = 42
fix_seed(SEED)
Is this really okay? I'm worried, but this is all right for fixing the seed. However, there are some points to note about the difference between random_seed and RandomState and around the GPU, so I will explain a little.
random --- Generate Pseudo-Random Numbers — Python 3.8.3 Documentation
random.seed(seed)
By default, the current system time is used, but some OSs have OS-specific random number sources.
[Mersenne Twister](https://ja.wikipedia.org/wiki/%E3%83%A1%E3%83%AB%E3%82%BB%E3%83%B3%E3%83%8C%E3 A pseudo-random number generator called% 83% BB% E3% 83% 84% E3% 82% A4% E3% 82% B9% E3% 82% BF) is used.
Note that Numpy random number generation uses a different seed each time it is executed.
import numpy as np
np.random.seed(42)
#First time
print(np.random.randint(0, 1000, 10))
# -> [102 435 860 270 106 71 700 20 614 121]
#Second time
print(np.random.randint(0, 1000, 10))
# -> [466 214 330 458 87 372 99 871 663 130]
If you want to fix it, set the seed each time.
import numpy as np
np.random.seed(42)
#First time
print(np.random.randint(0, 1000, 10))
# -> [102 435 860 270 106 71 700 20 614 121]
#Second time
np.random.seed(42)
print(np.random.randint(0, 1000, 10))
# -> [102 435 860 270 106 71 700 20 614 121]
Even if the environment or OS changes, if the first fixed seed is the same, the output after that seems to be the same.
If you just want to keep the reproducibility of the experiment, it seems that there is no problem if you fix only the seed at the beginning as mentioned above.
np.random.seed (42)
is basically okay, but be careful if the seed is fixed even in the external module. If you overwrite it like np.random.seed (43)
in the external module, the seed of the caller will also be overwritten.
Libraries such as Optuna and Pandas have taken this into account and prepared a new random number generation class with numpy.random.RandomState
.
np.random.seed(42)
'''
Some processing
'''
df.sample(frac=0.5, replace=True, random_state=43)
The seed of pandas is fixed by including random_state = 43
in the argument.
With this, the seed of numpy fixed at the beginning will not be overwritten by 43.
s = pd.Series(np.arange(100))
np.random.seed(42)
#First run at 42
print(s.sample(n=3)) # -> (83, 53, 70)
#The second time another random seed is applied
print(s.sample(n=3)) # -> (79, 37, 65)
print(s.sample(n=3, random_state=42)) # -> (83, 53, 70)
print(s.sample(n=3, random_state=42)) # -> (83, 53, 70)
Furthermore, like Numpy, note that the seed is not fixed after the second time. Save it in a variable or set the value of random_state each time.
If you run the jupyter notebook sequentially and finally the number of calls is the same, you can keep the reproducibility by setting np.random.seed (42)
once at the beginning.
However, please note that reproducibility may not be maintained slightly when using GPU as described later.
You can specify random_state
with the train_test_split function of Scikit-learn, but there is no way to fix it for the entire Scikit-learn.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, random_state=SEED)
How to set the global random_state in Scikit Learn | Bartosz Mikulski
According to the above link, it is okay if you fix the random seed of Numpy, but be careful because the result will change every time you execute split from the second time onwards.
How can I obtain reproducible optimization results?
sampler = TPESampler(seed=SEED) # Make the sampler behave in a deterministic way.
study = optuna.create_study(sampler=sampler)
study.optimize(objective)
Since another RandomState instance is prepared in Optuna, it is possible to specify seed. RandomState is used internally.
When using Cross-Validation
lgb.cv(lgbm_params,
lgb_train,
early_stopping_rounds=10,
nfold=5,
shuffle=True,
seed=42,
callbacks=callbacks,
)
Can be set as. In the manual
Seed used to generate the folds (passed to numpy.random.seed)
Because it says, "Oh! Is this the seed will be rewritten?", But if you look at the source code
randidx = np.random.RandomState(seed).permutation(num_data)
It seems to be okay because it was.
Also, when using the Scikit-learn API
clf = lgb.LGBMClassifier(random_state=42)
Can be set as.
The manual states that the C ++ default seed will be used if not set.
If None, default seeds in C++ code are used.
If you start to wonder what the default seed of C ++ is, there is no end to it, so I will stop here.
Reproducibility — PyTorch 1.5.0 documentation
torch.manual_seed(seed)
#For cuDNN
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
There is a method called torch.cuda.manual_seed_all (seed)
, but with the latest Pytorch, torch.manual_seed (seed)
is enough.
Also, the manual says:
Deterministic operation may have a negative single-run performance impact, depending on the composition of your model. Due to different underlying operations, which may be slower, the processing speed (e.g. the number of batches trained per second) may be lower than when the model functions nondeterministically. However, even though single-run speed may be slower, depending on your application determinism may save time by facilitating experimentation, debugging, and regression testing.
Note that if the GPU processing is set to Deterministic, the processing speed may slow down.
When reproducibility does not matter & When the network structure (calculation graph) does not change
torch.backends.cudnn.benchmark = True
Can speed up
Basically fix the seed as shown below
tf.random.set_seed(seed)
However, you can also specify the seed value at the operation level as shown below.
tf.random.uniform([1], seed=1)
To be honest, I didn't find much information about Tensorflow's GPU. GPU and random number generation seem to have some deep-seated problems. Software and hardware will be completely different.
NVIDIA/tensorflow-determinism: Tracking, debugging, and patching non-determinism in TensorFlow
Just as Pytorch also runs the risk of slowing down, you should consider that there is a trade-off between reproducibility and GPU processing performance.
Since data types such as FP16 and INT8 may be converted inside the GPU for speeding up, rounding errors may not be negligible. There are likely to be many things to think about in order to maintain reproducibility.
"The answer to the ultimate question about life, the universe, and all things was released by the supercomputer Deep Thought in the novel The Hitchhiker's Guide to the Galaxy. % 94% 9F% E5% 91% BD% E3% 80% 81% E5% AE% 87% E5% AE% 99% E3% 80% 81% E3% 81% 9D% E3% 81% 97% E3% 81 % A6% E4% B8% 87% E7% 89% A9% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6% E3% 81% AE% E7% A9% B6 % E6% A5% B5% E3% 81% AE% E7% 96% 91% E5% 95% 8F% E3% 81% AE% E7% AD% 94% E3% 81% 88) "is 42.
What is it about the random seed "4242"? | Kaggle
In Kaggle, the code ~~ copy ~~ is reused frequently, so the part that someone used in the joke, seed = 42, has become popular.
Nowadays, we sometimes ensemble the prediction of the model trained by changing the seed value.
--Be careful because the seed changes every time you execute numpy-related random number generation. ――Reproducibility cannot be maintained unless you explicitly set seed, especially when random numbers are generated each time you execute a method, or when you do not know how many times it will be called. --When using an external library, set random_state each time you call it. --Prepare RandomState again so as not to overwrite the seed of numpy when creating a module by yourself --Random number generation around GPU is quite complicated. There is a trade-off between processing speed and reproducibility (or rather accuracy?)
Click here for a simple experimental code machine_leraning_experiments/random_seed_experiment.ipynb at master · si1242/machine_leraning_experiments
Recommended Posts