PyTorch's DataLoader has a mechanism for multi-process data loading. When I tried to use it on Windows, it didn't work with the same error as here. Therefore, I investigated various things and solved it, so I will make a note of the method.

Operating environment --OS: Windows10 Home (version 2004)
- Python: 3.7.3
- PyTorch: 1.3.1

DataLoader multi-process

Quoted from Official Docs

A DataLoader uses single-process data loading by default.

Within a Python process, the Global Interpreter Lock (GIL) prevents true fully parallelizing Python code across threads. To avoid blocking computation code with data loading, PyTorch provides an easy switch to perform multi-process data loading by simply setting the argument num_workers to a positive integer.

And that. Roughly speaking, if the value of the variable num_workers in the DataLoader class is set to 1 or more, data reading can be parallelized. BrokenPipeError So, when I set num_workers to a value of 1 or more and moved it,

BrokenPipeError: [Errno 32] Broken pipe

It did not work with an error. Even if you define Dataset in another file by referring to Error when you want to load Pytorch Dataset in parallel with DataLoader (Windows) A similar error occurred.

Solution

If you refer to here, it seems that if you want to multiprocess on Windows, ʻif name == You have to execute a function that multiprocesses in "main" `.

Before correction

`train.py`


from torch.utils.data import DataLoader
from dataloader import MyDataset #Created dataset

def train():
    dataset = MyDataset()
    train_loader = DataLoader(dataset, num_workers=2, shuffle=True,
                              batch_size=4,
                              pin_memory=True,
                              drop_last=True)

    for batch in train_loader:
        #do some process...

if __name__ == "__main__":
    train()

Revised

`train.py`


from torch.utils.data import DataLoader
from dataloader import MyDataset #Created dataset

def train(train_loader):
    for batch in train_loader:
        #do some process...

if __name__ == "__main__":
    #dataset,Move DataLoader
    dataset = MyDataset()
    train_loader = DataLoader(dataset, num_workers=2, shuffle=True,
                              batch_size=4,
                              pin_memory=True,
                              drop_last=True)

    train(train_loader)

In the case of DataLoader, if the instance was created in ʻif name == "main" `, the multi-process worked even if the data reading itself was executed in another function.

Summary

I wrote a memo for parallelizing DataLoader in Windows environment. Around deep learning, there are many tasks that do not work on Windows or cannot be done without some ingenuity. Therefore, I would like to regularly write articles about errors that occur around Windows.

How to avoid BrokenPipeError with PyTorch's DataLoader Note

DataLoader multi-process

Solution

train.py

train.py

Summary

`train.py`

`train.py`