The one that divides the csv file, reads it, and processes it in parallel

Isono ~! I have a csv file that contains millions of records, so let's split it up and process it in parallel!

Remarks

――I didn't see you doing splitting and paralleling together, so I wrote it as a memorandum, please tell me if there is any nice article --python3.7 --I don't use pandas -pool Use -future Not used

I'll do it right away

File to read

sample.csv


1,Ah
2,I
3,U
4,e
5,O

I borrowed gen_chunks () because there was a person who made it https://stackoverflow.com/a/4957046 Someone has already done most of the work. Thank you internet, thank you ancestors Click here for the completed code

pool.py


import csv
import time
from multiprocessing import Pool

def read():
    f = open("sample.csv", "r")
    reader = csv.reader(f)

    pool = Pool()
    results = []
    for data_list in gen_chunks(reader):
        results.append(pool.apply_async(do_something, [data_list]))
    pool.close()
    pool.join()
    _ = [r.get() for r in results]

    f.close()

def do_something(data_list):
    print(f"start {data_list}")
    time.sleep(len(data_list))
    # hoge
    print(f"finish {data_list}")

def gen_chunks(reader, chunksize=2):
    """
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices.
    """
    chunk = []
    for i, line in enumerate(reader):
        if i % chunksize == 0 and i > 0:
            yield chunk
            chunk = []
        chunk.append(line)
    yield chunk

read()

result


start [['1', 'Ah'], ['2', 'I']]
start [['3', 'U'], ['4', 'e']]
start [['5', 'O']]
finish [['5', 'O']]
finish [['1', 'Ah'], ['2', 'I']]
finish [['3', 'U'], ['4', 'e']]

Since chunksize = 2, two records are passed todo_something (). If it's a big csv, let's make it feel good here. I'm picking up an error with _ = [r.get () for r in results]. I should handle the error, but I omit it because it is troublesome There seems to be a better way to write it, so please let me know if you know

Also, as pointed out on stackoverflow, if the array reset part of gen_chunks () is set to del chunk [:], the output will be as follows.

result


start [['5', 'O']]
start [['5', 'O']]
start [['5', 'O']]
finish [['5', 'O']]
finish [['5', 'O']]
finish [['5', 'O']]

I didn't read the comments properly so I witnessed a disappointing result that's sad

Recommended Posts

The one that divides the csv file, reads it, and processes it in parallel
A Python script that reads a SQL file, executes BigQuery and saves the csv
Normalize the file that converted Excel to csv as it is.
Format the Git log and get the committed file name in csv format
Find it in the procession and edit it
Predict the amount of electricity used in 2 days and publish it in CSV
Handle CSV that contains the element you want to parse in the file name
Note that I understand the least squares algorithm. And I wrote it in Python.
The one that displays the progress bar in Python
In bash, "Delete the file if it exists".
I made a tool in Python that right-clicks an Excel file and divides it into files for each sheet.
Extract the lyrics information in the MP3 / MP4 file and save it in the lyrics file (* .lrc) for Sony walkman.
I want to replace the variables in the python template file and mass-produce it in another file.
Read the csv file with jupyter notebook and write the graph on top of it
A script that opens the URLs written in CSV in order and takes a full screen screenshot
Save the pystan model and results in a pickle file
[Python] Read the csv file and display the figure with matplotlib
Get the MIME type in Python and determine the file format
Goroutine (parallel control) that can be used in the field
Scraping the list of Go To EAT member stores in Fukuoka prefecture and converting it to CSV
Scraping the list of Go To EAT member stores in Niigata prefecture and converting it to CSV
Read and write csv file
Search the file name including the specified word and extension in the directory
Scraping the schedule of Hinatazaka46 and reflecting it in Google Calendar
[Note] Based on the latitude and longitude of the CSV file, we created a script that extracts data in the target range and adds a mesh code.