[Python] Accelerates loading of time series CSV

When analyzing time series data, you often read data from csv / tsv files.

If the file has a capacity of several tens of MB, you may not care so much, but if it is a file of several hundred MB, it takes several seconds to several tens of seconds just to read it, and the waiting that occurs every time you execute the code Time becomes stressful.

Here, I will introduce a method that can explode the reading process with a little ingenuity.

Execution environment

Measure the execution time with the MBP at hand.

MacBook Pro (Retina, 13-inch, Mid 2014)

In addition, the execution environment of the measurement code is the latest version of 2.7 series.

$ python --version
$ Python 2.7.10

Use the pandas module to read the csv file.

>>> import pandas
>>> pandas.__version__
0.18.1

The csv file to be read uses the data of the dollar-yen exchange rate in 1-minute increments from 2005 to 2015 that was just at hand. Contains data of 5 columns of time, open price, high price, low price, and close price x 3687221 rows.

The contents are like this

time	open	high	low	close
2005-01-03 03:00:00	102.360	102.390	102.360	102.390

$ wc -c < usdjpy-1m.csv
204760049

It's a csv file of about 204MB.

0 ・ Preparation

Use the time method of the standard time module to measure the time. I prepared the following code as a preparation for that.

The calctime method is a simple measurement method that executes an arbitrary expression and returns the difference between the start time and the end time.

`calctime.py`


#!/usr/bin/python
# -*- encoding: utf-8 -*-

from time import time

def calctime(func):
	start = time()
	r = func()
	return {'value': r, 'time': time()-start}

1. Read normally

First, in order to measure the reference time, use the read_csv method to measure the time.

`calctime.py`


import pandas

csv = calctime(lambda: pandas.read_csv('usdjpy-1m.csv'))
print('Read normally_csv{}It took a second'.format(csv['time']))

Read normally_csv 7.56608009338 It took 2 seconds

It took 7.5 seconds. In addition, since the file handled this time is time series data, I will also parse the column that stores the time to the datetime type.

`calctime.py`


import pandas

csv = calctime(lambda: pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time'))
print('read while parsing the time column_csv{}It took a second'.format(csv['time']))

read while parsing the time column_40 with csv.0371601582 It took 2 seconds

It takes 40 seconds. I can't talk about waiting 40 seconds each time, so let's speed it up.

2. Serialize the object

Python comes standard with a module called pickle that serializes objects in memory into bytes.

As the name of the pickle singular system, it provides a function to convert a short-lived object in memory into a form that can be output to a file and save it in storage. Raw vegetables can also be pickled and stored for a long time.

The code looks like this.

Save the object once stored in the variable with pandas.read_csv as usdjpy-1m.csv.pkl using pickle.dump, and read the dumped byte string using pickle.load. I'm measuring time.

`calctime.py`


import pandas
import pickle

#loading csv
csv = pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time')

#Dump bytes to file
pickle.dump(csv, open('usdjpy-1m.csv.pkl', 'w'))

#Reload the dumped file
pickled = calctime(lambda: pickle.load(open('usdjpy-1m.csv.pkl', 'w')))
print('When loading with pickle{}It took a second'.format(pickled['time']))

5 when loading with pickle.65008401871 took 1 second

It has been greatly reduced from 40 seconds to 5.6 seconds.

It seems that it is working because it is no longer necessary to parse the datetime type every time.

Once you dump it with pickle, you can read it from the next time and use it as a pandas object, so it's easy. I've made a lot of progress.

But it's long enough to wait 5.6 seconds each time. Let's make this pickle even faster.

3. Specify the pickle protocol

As shown in http://docs.python.jp/2/library/pickle.html#pickle.dump , Pickle can specify the protocol to use when dumping.

There are versions 0 to 2 of the protocol, with 0 being the oldest and slowest and 2 being the newest and fastest protocol. Version 2 can only be used in Python 2.3 or higher environments, and has the feature of not being backward compatible.

And the default is version 0. If you use pickle.dump without specifying anything, you end up using a slow protocol instead of backward compatibility.

Let's specify the protocol in the code above and re-execute.

`calctime.py`


import pandas
import pickle

#loading csv
csv = pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time')

#Dump bytes to a file using version 2 protocol
pickle.dump(csv, open('usdjpy-1m.csv.pkl', 'w'), protocol=2)

#Reload the dumped file
pickled = calctime(lambda: pickle.load(open('usdjpy-1m.csv.pkl', 'w')))
print('With version 2{}It took a second'.format(pickled['time']))

1 with version 2.07643604279 It took 2 seconds

5.6 seconds → It has been shortened to about 1 second. You can now read at a nice speed.

But I'm still not satisfied. Waiting for 1 second each time is stressful.

Finally, let's add another effort to explode the code.

4. Save pickle in redis

If the byte string dumped by pickle is saved as a file on the storage, it will take some time to read.

Conversely, the fact that access time to storage is a bottleneck means that it should be eliminated.

Here, let's change the save destination to redis, which saves data in memory instead of storage.

redis is a key-value store type database that operates on-memory, and because it stores data in memory, it is characterized by extremely high speed.

Detailed installation method and explanation are omitted here, so please check it if you are interested.

If the installation is successful, is redis working properly?

$ redis-cli ping
PONG

You can check by typing the command.

To access redis from python, we will install the redis module with pip.

$ pip install redis

Let's change the previous code a little, save the read csv as a byte string to dump and redis, and then measure the time to get it from redis.

This time, neither dump nor load is performed by specifying a file, but I want to expand it to a variable, so the methods used are pickle.dumps and pickle.loads.

`calctime.py`


import pandas
import pickle
import redis

#loading csv
csv = pandas.read_csv('usdjpy-1m.csv', parse_dates=['time'], index_col='time')

#Dump bytes using version 2 protocol
dumped = pickle.dumps(csv, protocol=2)

#Save the dumped bytes in redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.set('usdjpy-1m.csv', dumped)

#Read bytes from redis and restore with pickle
csv = calctime(lambda: pickle.loads(r.get('usdjpy-1m.csv')))
print('With redis{}It took a second'.format(csv['time']))

0 with redis.It took 332698106766 seconds

It has been reduced from 1 second to 0.33 seconds.

With this, you can execute the code without any stress! Compared to the initial 40 seconds of loading, it was 121 times faster.

bonus

Here is the set of code used in the experiment.

`calctime.py`


#!/usr/bin/python
# -*- encoding: utf-8 -*-

import pandas
import pickle
import redis
from time import time

#Time measurement method
def calctime(func):
	start = time()
	r = func()
	return {'value': r, 'time': time()-start}

#csv file path
csv_filepath = 'usdjpy-1m.csv'

#Read normally
read_csv = calctime(lambda: pandas.read_csv(csv_filepath, parse_dates=['time'], index_col='time'))
print('Read normally_csv{}It took a second'.format(read_csv['time']))

#pickle Read using Protocolol version 0
pickle.dump(read_csv['value'], open('csv0.pkl', 'w'))
pickle_load_0 = calctime(lambda: pickle.load(open('csv0.pkl', 'r')))
print('pickle with Protocolol version 0{}It took a second'.format(pickle_load_0['time']))

#pickle Read using Protocolol version 2
pickle.dump(read_csv['value'], open('csv2.pkl', 'w'), protocol=2)
pickle_load_2 = calctime(lambda: pickle.load(open('csv2.pkl', 'r')))
print('pickle with Protocolol version 2{}It took a second'.format(pickle_load_2['time']))

#Read using redis
r = redis.Redis(host='localhost', port=6379, db=0)
r.set(csv_filepath, pickle.dumps(read_csv['value'], protocol=2))
redis_get = calctime(lambda: pickle.loads(r.get(csv_filepath)))
print('With redis{}It took a second'.format(redis_get['time']))

Finally

When performing data analysis, I think that there are many situations where you have to execute the code many times and repeat trials and errors, but I hope that it will help you to rotate the iterations as efficiently as possible in such cases.

The author posts technical information on Twitter every day. I would be grateful if you could follow me.

https://twitter.com/YuhsakInoue