From pandas1.0.0, you can specify S3 / GCS URL with read_pickle () and to_pickle ()!

pandas 1.0.0 has been released.

https://pandas.pydata.org/pandas-docs/stable/whatsnew/v1.0.0.html

As a new feature

--Pandas.NA, which handles different types of missing values in a unified manner, has been introduced on a trial basis. --Added an option to speed up rolling.apply () with Numba --Added to_markdown () method that can output data frame in Markdown notation

However, personally, the above release note 'Other enhancements' -enhancements) I noticed the casually written description at the very end:

DataFrame.to_pickle() and read_pickle() now accept URL (GH30163)

This means that you can directly save and read the data that was solidified by pickle on the cloud storage! ??

I tried it for a reason:

Demonstration

Create a suitable DataFrame like the one below.

df = pd.DataFrame({'hoge': [1, 2, 3], 'fuga': [4, 5, 6], 'piyo': [7, 8, 9]})

By doing this,

hoge fuga piyo
0 1 4 7
1 2 5 8
2 3 6 9

I was able to express that. [^ 1]

[^ 1]: By the way, this table was output using df.to_markdown () which was also added from pandas 1.0.0. Convenient.

Save this in the AWS S3 bucket 's3: // tatamiya-test / created in advance and read it. [^ 2] [^ 3]

[^ 2]: I will omit the setting method of IAM and credential key. [^ 3]: I haven't confirmed it, but you should be able to use GCS.

For pandas 0.25.3

As long as it is saved / read in .csv format, it was possible with the conventional version.

pd.__version__
# >> '0.25.3'

df.to_csv('s3://tatamiya-test/test.csv', index=None)

pd.read_csv('s3://tatamiya-test/test.csv')
# >> hoge	fuga	piyo
# >> 0	1	4	7
# >> 1	2	5	8
# >> 2	3	6	9

However, in the case of pickle, the PATH was not recognized by the same operation.

pd.__version__
# >> '0.25.3'

df.to_pickle('s3://tatamiya-test/test.pkl')
# >> FileNotFoundError: [Errno 2] No such file or directory: 's3://tatamiya-test/test.pkl'

pd.read_pickle('s3://tatamiya-test/test.pkl')
# >> FileNotFoundError: [Errno 2] No such file or directory: 's3://tatamiya-test/test.pkl'

For pandas 1.0.0

Now, let's try with the latest version 1.0.0.

pd.__version__
# >> '1.0.0'

df.to_pickle('s3://tatamiya-test/test.pkl')

pd.read_pickle('s3://tatamiya-test/test.pkl')
# >> hoge	fuga	piyo
# >> 0	1	4	7
# >> 1	2	5	8
# >> 2	3	6	9

I was able to confirm that it can be saved and read properly!

At the end

When processing data using pandas on the cloud, even if the data source is on S3 or GCS, if it is a csv format file, it was possible to directly specify the URL with to_csv () and read it.

However, when saving the intermediate data after shaping and processing, it is better to convert the DataFrame or Series to a byte string as a Python object and save it.

--Small capacity --Fast reload --No need to retype

To_pickle () and read_pickle () were useful because they had such merits.

However, as mentioned above, in the past it was not possible to specify the S3 / GCS URL as the save destination at this time.

--Use client library --Save locally and then skip with the command line tool --Mount the target bucket on the VM instance in advance

I had to take some time and effort.

That's why this update was sober but personally very grateful!

(However, since backward compatibility is not guaranteed, it is not possible to introduce 1.0.0 with the code as it is ...)

Recommended Posts

From pandas1.0.0, you can specify S3 / GCS URL with read_pickle () and to_pickle ()!
What you can and cannot do with Tensorflow 2.x
[AWS] Search and acquire necessary data from S3 files with S3 Select
Consider what you can do with Python from the Qiita article
Get OCTA simulation conditions from a file and save with pandas