PyTorch's DataLoader has a mechanism for multi-process data loading. When I tried to use it on Windows, it didn't work with the same error as here. Therefore, I investigated various things and solved it, so I will make a note of the method.
Quoted from Official Docs
A
DataLoader
uses single-process data loading by default.
Within a Python process, the Global Interpreter Lock (GIL) prevents true fully parallelizing Python code across threads. To avoid blocking computation code with data loading, PyTorch provides an easy switch to perform multi-process data loading by simply setting the argument
num_workers
to a positive integer.
And that. Roughly speaking, if the value of the variable num_workers
in the DataLoader class is set to 1 or more, data reading can be parallelized.
BrokenPipeError
So, when I set num_workers
to a value of 1 or more and moved it,
BrokenPipeError: [Errno 32] Broken pipe
It did not work with an error.
Even if you define Dataset
in another file by referring to Error when you want to load Pytorch Dataset in parallel with DataLoader (Windows) A similar error occurred.
If you refer to here, it seems that if you want to multiprocess on Windows, ʻif name == You have to execute a function that multiprocesses in "main" `.
Before correction
train.py
from torch.utils.data import DataLoader
from dataloader import MyDataset #Created dataset
def train():
dataset = MyDataset()
train_loader = DataLoader(dataset, num_workers=2, shuffle=True,
batch_size=4,
pin_memory=True,
drop_last=True)
for batch in train_loader:
#do some process...
if __name__ == "__main__":
train()
Revised
train.py
from torch.utils.data import DataLoader
from dataloader import MyDataset #Created dataset
def train(train_loader):
for batch in train_loader:
#do some process...
if __name__ == "__main__":
#dataset,Move DataLoader
dataset = MyDataset()
train_loader = DataLoader(dataset, num_workers=2, shuffle=True,
batch_size=4,
pin_memory=True,
drop_last=True)
train(train_loader)
In the case of DataLoader, if the instance was created in ʻif name == "main" `, the multi-process worked even if the data reading itself was executed in another function.
I wrote a memo for parallelizing DataLoader in Windows environment. Around deep learning, there are many tasks that do not work on Windows or cannot be done without some ingenuity. Therefore, I would like to regularly write articles about errors that occur around Windows.
Recommended Posts