[DOCKER] [Explanation with image] Use pickle with Kaggle's NoteBook

things to do

  1. Pickle the dataset of the csv file
  2. Read the pickle file with another notebook

What is pickle?

The one that saves python objects as binary data https://docs.python.org/ja/3/library/pickle.html

What are you happy about?

Loading is fast Since it is binary data, parsing processing is fast because it is unnecessary. Trained models can be pickled and reused

This verification article is wonderful Python: I investigated the persistence format of pandas

Try with Titanic data

Make train.csv pickle for the time being This is the only code

#pickle is a standard library so no install required
import pickle

import pandas as pd


train = pd.read_csv('../input/titanic/train.csv')

# 'wb'(write binary)Specify
with open('train.pickle', 'wb') as f:
    pickle.dump(train, f)

Save as Dataset

First commit スクリーンショット 2019-12-09 3.16.06.png

When the green Complete appears in the upper left, click Open Version. スクリーンショット 2019-12-09 3.17.42.png

Scroll to the Output column スクリーンショット 2019-12-09 3.18.37.png

If you can see train.pickle, then New Dataset スクリーンショット 2019-12-09 3.19.07.png

Enter your favorite Dataset title and create スクリーンショット 2019-12-09 3.20.36.png

Dataset is completed スクリーンショット 2019-12-09 3.21.08.png

Bring it to another notebook

If you create a new notebook + Add Data スクリーンショット 2019-12-09 3.44.17.png

Filter by Your Datasets スクリーンショット 2019-12-09 3.44.44.png

Add the guy you just made スクリーンショット 2019-12-09 3.45.11.png

Win if displayed here スクリーンショット 2019-12-09 3.46.31.png

Let's read

This is the only code

# 'rb'(read binary)Specify
with open('../input/titanicdatasetpickles/train.pickle', 'rb') as f:
    train = pickle.load(f)

It is properly loaded as a DataFrame.

train.shape

# (891, 12)

Please note that the directory name may differ from the one displayed on the right side of the screen. `https://www.kaggle.com/anata-no-namae/data-set-no-namae` <-This notation becomes the directory name If you check with the `ls` command, you can see that the directory name is` -`pear.
!ls ../input

# titanicdatasetpickles

May be filed

Let's use the dump process

dump_pickles.py



import pickle

import pandas as pd


#Switch path between Kaggle and another environment
if '/kaggle/working' in _dh:
    input_path = '../input'
else:
    input_path = './input'

#Rewrite only here for each competition
data_sets = {
    'train': f'{input_path}/titanic/train.csv',
    'test': f'{input_path}/titanic/test.csv',
    'gender_submission': f'{input_path}/titanic/gender_submission.csv'
}

for name, path in data_sets.items():
    df = pd.read_csv(path)
    with open(f'{name}.pickle', 'wb') as f:
        pickle.dump(df, f)

You can do the same with pandas

#this is
with open('./train.pickle', 'wb') as f:
    pickle.dump(train, f)

#like this
train.to_pickle('./train.pickle')
#this is
with open('../input/titanicdatasetpickles/train.pickle', 'rb') as f:
    df_ss = pickle.load(f)

#like this
train = pd.read_pickle('../input/titanicdatasetpickles/train.pickle')

Sometimes I get this error

ModuleNotFoundError: No module named 'pandas.core.internals.managers'; 'pandas.core.internals' is not a package

It seems to be a problem with the version of pandas

pip install -U pandas

Solved by

I was saved by this article Inconsistency between pickle and pandas

The end

Thank you for reading to the end

Recommended Posts

[Explanation with image] Use pickle with Kaggle's NoteBook
Use pip with Jupyter Notebook
Use Cython with Jupyter Notebook
Use Bokeh with IPython Notebook
Use markdown with jupyter notebook (with shortcut)
Use apache Spark with jupyter notebook (IPython notebook)
Use Jupyter Lab and Jupyter Notebook with EC2
Use cryptography library cryptography with Docker Python image
How to use jupyter notebook with ABCI
Play with custom image notebook on Kubeflow v0.71
Use mecab-ipadic-neologd with igo-python
Image processing with MyHDL
Use ansible with cygwin
Use pipdeptree with virtualenv
[Python] Use JSON with Python
Use Mock with pytest
Use indicator with pd.merge
Image recognition with keras
Use Gentelella with django
Use mecab with Python3
Use tensorboard with Chainer
Use DynamoDB with Python
Use pip with MSYS2
Use Python 3.8 with Anaconda
Use pyright with Spacemacs
Use python with docker
Use TypeScript with django-compressor
Image processing with Python
Use LESS with Django
Use MySQL with Django
Use Enums with SQLAlchemy
Use tensorboard with NNabla
Use GPS with Edison
Use nim with Jupyter
Image Processing with PIL
I want to use R functions easily with ipython notebook
I want to use a virtual environment with jupyter notebook!