things to do

Pickle the dataset of the csv file
Read the pickle file with another notebook

What is pickle?

The one that saves python objects as binary data https://docs.python.org/ja/3/library/pickle.html

What are you happy about?

Loading is fast Since it is binary data, parsing processing is fast because it is unnecessary. Trained models can be pickled and reused

This verification article is wonderful Python: I investigated the persistence format of pandas

Try with Titanic data

Make train.csv pickle for the time being This is the only code

#pickle is a standard library so no install required
import pickle

import pandas as pd


train = pd.read_csv('../input/titanic/train.csv')

# 'wb'(write binary)Specify
with open('train.pickle', 'wb') as f:
    pickle.dump(train, f)

Save as Dataset

First commit スクリーンショット 2019-12-09 3.16.06.png

When the green Complete appears in the upper left, click Open Version. スクリーンショット 2019-12-09 3.17.42.png

Scroll to the Output column スクリーンショット 2019-12-09 3.18.37.png

If you can see train.pickle, then New Dataset スクリーンショット 2019-12-09 3.19.07.png

Enter your favorite Dataset title and create スクリーンショット 2019-12-09 3.20.36.png

Dataset is completed スクリーンショット 2019-12-09 3.21.08.png

Bring it to another notebook

If you create a new notebook + Add Data スクリーンショット 2019-12-09 3.44.17.png

Filter by Your Datasets スクリーンショット 2019-12-09 3.44.44.png

Add the guy you just made スクリーンショット 2019-12-09 3.45.11.png

Win if displayed here スクリーンショット 2019-12-09 3.46.31.png

Let's read

This is the only code

# 'ｒb'(read binary)Specify
with open('../input/titanicdatasetpickles/train.pickle', 'rb') as f:
    train = pickle.load(f)

It is properly loaded as a DataFrame.

train.shape

# (891, 12)

Please note that the directory name may differ from the one displayed on the right side of the screen. `https://www.kaggle.com/anata-no-namae/data-set-no-namae` <-This notation becomes the directory name If you check with the `ls` command, you can see that the directory name is` -`pear.

!ls ../input

# titanicdatasetpickles

May be filed

Let's use the dump process

`dump_pickles.py`



import pickle

import pandas as pd


#Switch path between Kaggle and another environment
if '/kaggle/working' in _dh:
    input_path = '../input'
else:
    input_path = './input'

#Rewrite only here for each competition
data_sets = {
    'train': f'{input_path}/titanic/train.csv',
    'test': f'{input_path}/titanic/test.csv',
    'gender_submission': f'{input_path}/titanic/gender_submission.csv'
}

for name, path in data_sets.items():
    df = pd.read_csv(path)
    with open(f'{name}.pickle', 'wb') as f:
        pickle.dump(df, f)

You can do the same with pandas

#this is
with open('./train.pickle', 'wb') as f:
    pickle.dump(train, f)

#like this
train.to_pickle('./train.pickle')

#this is
with open('../input/titanicdatasetpickles/train.pickle', 'rb') as f:
    df_ss = pickle.load(f)

#like this
train = pd.read_pickle('../input/titanicdatasetpickles/train.pickle')

Sometimes I get this error

ModuleNotFoundError: No module named 'pandas.core.internals.managers'; 'pandas.core.internals' is not a package

It seems to be a problem with the version of pandas

pip install -U pandas

Solved by

Note that Kaggle's official docker image (kaggle / python) has an error with pandas == 0.23.4 (as of December 09, 2019).

I was saved by this article Inconsistency between pickle and pandas

The end

Thank you for reading to the end

[DOCKER] [Explanation with image] Use pickle with Kaggle's NoteBook