Settings when reading S3 files with pandas from Jupyter Notebook on AWS

Overview

If you are operating Jupyter Notebook on EC2 on AWS, specify the data path in S3 from pd.read_csv () as shown below. You can read it directly.

import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "XXX..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "YYY..."
all_df = pd.read_csv("s3n://mybucket/path/to/dir/data.csv")

At this time, it is necessary to specify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the environment variables in advance as described above.

To be honest, it's a hassle to make these designations every time. I'm sure many people have a habit of importing packages such as ʻimport pandas as pd`, but not so many people can remember the keys of random strings. Besides, this kind of private information is information that you don't want to include when sharing your notebook with others, so it is very inefficient and dangerous to delete it each time.

Therefore, let's set in advance so that you can access S3 files without being aware of it with Jupyter Notebook. There are several ways to do this, and I would like to introduce each one.

--Load settings when jupyter notebook starts --Set environment variables directly in the shell --List in boto profile

You only need to do one of these three methods. I don't have that much pros / cons, so I think it's okay to do whatever you want.

Method

Method 1: Load the settings when the jupyter notebook starts up

It is assumed that jupyter notebook is installed.

Save the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY settings in ~ / .ipython / profile_default / startup / with the file name 00_startup.ipy. The file name is free, but if you add a number at the beginning, it will be executed in that order.

os.environ["AWS_ACCESS_KEY_ID"] = "XXX..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "YYY..."

The file layout looks like the following.

$ tree .ipython/profile_default/startup
.ipython/profile_default/startup
├── 00_startup.ipy
└── README

By doing this, the above command will be executed when jupyter notebook is started, so you do not have to specify it in ʻos.environ` from the notebook.

Method 2: Set environment variables directly in the shell

This is a way to set it as a shell environment variable. Write it in your shell's configuration file, such as .bash_profile or .zshrc, and load it when the shell starts, or run it directly at the prompt to set it.

export AWS_ACCESS_KEY_ID=XXX...
export AWS_SECRET_ACCESS_KEY=YYY...

Method 3: Describe in the profile read by boto

Behind the scenes of pandas is a Python AWS SDK called boto, which you can also list directly in the profile of that package. Save the code below with the file name ~ / .boto.

[Credentials]
aws_access_key_id = XXX...
aws_secret_access_key = YYY...

== Addendum: 2016/10/03 == The above method was the method with boto2. Although ~ / .boto is also read in boto3, it is said that it is read in the following order from boto3, so it is better to describe it in ~ / .aws / credentials which is the credentials of aws itself ʻawscli ` I think it may mean sharing the settings with the command.

The mechanism in which boto3 looks for credentials is to search through a list of possible locations and stop as soon as it finds credentials. The order in which Boto3 searches for credentials is:

Passing credentials as parameters in the boto.client() method

Passing credentials as parameters when creating a Session object

Environment variables

Shared credential file (~/.aws/credentials)

AWS config file (~/.aws/config)

Assume Role provider

Boto2 config file (/etc/boto.cfg and ~/.boto)

Instance metadata service on an Amazon EC2 instance that has an IAM role configured.

[default]
aws_access_key_id=foo
aws_secret_access_key=bar

http://boto3.readthedocs.io/en/latest/guide/configuration.html

== End of postscript ==

Then, when you specify the path of S3 in pandas, boto will be imported and the above profile will be loaded. (source)

Please note that pip install pandas alone will not install boto because it is not included in the dependencies. Before using it, do pip install boto.

Summary

As mentioned above, there are several setting methods when accessing S3 with pandas. All of them are easy to do when building an environment, so be sure to set them.

Supplement

These methods are not limited to Jupyter Notebook on AWS, but you can connect to S3 in any environment as long as you have the AWS key, so you can also connect to S3 locally, pd.read_csv ("s3n: //...) However, you need to be careful when using it because the data is sent from S3 to the outside of AWS and the transfer amount is generated. In the first place, heavy data is difficult to handle due to the problem of transfer time. , If you want to run the source code that is supposed to use S3 in a different environment, you should check it. It may be wise not to set the AWS key as the default locally.

reference

-Introduction of Spark to EC2 and cooperation of iPython Notebook --Qiita

Using Jupyter on Apache Spark: Step-by-Step with a Terabyte of Reddit Data