The one that divides the csv file, reads it, and processes it in parallel

Isono ~! I have a csv file that contains millions of records, so let's split it up and process it in parallel!

Remarks

――I didn't see you doing splitting and paralleling together, so I wrote it as a memorandum, please tell me if there is any nice article --python3.7 --I don't use pandas -pool Use -future Not used

I'll do it right away

File to read

`sample.csv`


1,Ah
2,I
3,U
4,e
5,O

I borrowed gen_chunks () because there was a person who made it https://stackoverflow.com/a/4957046 Someone has already done most of the work. Thank you internet, thank you ancestors Click here for the completed code

`pool.py`


import csv
import time
from multiprocessing import Pool

def read():
    f = open("sample.csv", "r")
    reader = csv.reader(f)

    pool = Pool()
    results = []
    for data_list in gen_chunks(reader):
        results.append(pool.apply_async(do_something, [data_list]))
    pool.close()
    pool.join()
    _ = [r.get() for r in results]

    f.close()

def do_something(data_list):
    print(f"start {data_list}")
    time.sleep(len(data_list))
    # hoge
    print(f"finish {data_list}")

def gen_chunks(reader, chunksize=2):
    """
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices.
    """
    chunk = []
    for i, line in enumerate(reader):
        if i % chunksize == 0 and i > 0:
            yield chunk
            chunk = []
        chunk.append(line)
    yield chunk

read()

`result`


start [['1', 'Ah'], ['2', 'I']]
start [['3', 'U'], ['4', 'e']]
start [['5', 'O']]
finish [['5', 'O']]
finish [['1', 'Ah'], ['2', 'I']]
finish [['3', 'U'], ['4', 'e']]

Since chunksize = 2, two records are passed todo_something (). If it's a big csv, let's make it feel good here. I'm picking up an error with _ = [r.get () for r in results]. I should handle the error, but I omit it because it is troublesome There seems to be a better way to write it, so please let me know if you know

Also, as pointed out on stackoverflow, if the array reset part of gen_chunks () is set to del chunk [:], the output will be as follows.

`result`


start [['5', 'O']]
start [['5', 'O']]
start [['5', 'O']]
finish [['5', 'O']]
finish [['5', 'O']]
finish [['5', 'O']]

I didn't read the comments properly so I witnessed a disappointing result that's sad