If you are operating Jupyter Notebook on EC2 on AWS, specify the data path in S3 from pd.read_csv ()
as shown below. You can read it directly.
import pandas as pd
os.environ["AWS_ACCESS_KEY_ID"] = "XXX..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "YYY..."
all_df = pd.read_csv("s3n://mybucket/path/to/dir/data.csv")
At this time, it is necessary to specify AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY in the environment variables in advance as described above.
To be honest, it's a hassle to make these designations every time. I'm sure many people have a habit of importing packages such as ʻimport pandas as pd`, but not so many people can remember the keys of random strings. Besides, this kind of private information is information that you don't want to include when sharing your notebook with others, so it is very inefficient and dangerous to delete it each time.
Therefore, let's set in advance so that you can access S3 files without being aware of it with Jupyter Notebook. There are several ways to do this, and I would like to introduce each one.
--Load settings when jupyter notebook starts --Set environment variables directly in the shell --List in boto profile
You only need to do one of these three methods. I don't have that much pros / cons, so I think it's okay to do whatever you want.
It is assumed that jupyter notebook is installed.
Save the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY settings in ~ / .ipython / profile_default / startup /
with the file name 00_startup.ipy
. The file name is free, but if you add a number at the beginning, it will be executed in that order.
os.environ["AWS_ACCESS_KEY_ID"] = "XXX..."
os.environ["AWS_SECRET_ACCESS_KEY"] = "YYY..."
The file layout looks like the following.
$ tree .ipython/profile_default/startup
.ipython/profile_default/startup
├── 00_startup.ipy
└── README
By doing this, the above command will be executed when jupyter notebook is started, so you do not have to specify it in ʻos.environ` from the notebook.
This is a way to set it as a shell environment variable. Write it in your shell's configuration file, such as .bash_profile
or .zshrc
, and load it when the shell starts, or run it directly at the prompt to set it.
export AWS_ACCESS_KEY_ID=XXX...
export AWS_SECRET_ACCESS_KEY=YYY...
Behind the scenes of pandas is a Python AWS SDK called boto, which you can also list directly in the profile of that package. Save the code below with the file name ~ / .boto
.
[Credentials]
aws_access_key_id = XXX...
aws_secret_access_key = YYY...
== Addendum: 2016/10/03 ==
The above method was the method with boto2. Although ~ / .boto
is also read in boto3, it is said that it is read in the following order from boto3, so it is better to describe it in ~ / .aws / credentials
which is the credentials of aws itself ʻawscli ` I think it may mean sharing the settings with the command.
The mechanism in which boto3 looks for credentials is to search through a list of possible locations and stop as soon as it finds credentials. The order in which Boto3 searches for credentials is:
- Passing credentials as parameters in the boto.client() method
- Passing credentials as parameters when creating a Session object
- Environment variables
- Shared credential file (~/.aws/credentials)
- AWS config file (~/.aws/config)
- Assume Role provider
- Boto2 config file (/etc/boto.cfg and ~/.boto)
- Instance metadata service on an Amazon EC2 instance that has an IAM role configured.
[default]
aws_access_key_id=foo
aws_secret_access_key=bar
== End of postscript ==
Then, when you specify the path of S3 in pandas, boto will be imported and the above profile will be loaded. (source)
Please note that pip install pandas
alone will not install boto because it is not included in the dependencies. Before using it, do pip install boto
.
As mentioned above, there are several setting methods when accessing S3 with pandas. All of them are easy to do when building an environment, so be sure to set them.
These methods are not limited to Jupyter Notebook on AWS, but you can connect to S3 in any environment as long as you have the AWS key, so you can also connect to S3 locally, pd.read_csv ("s3n: //...)
However, you need to be careful when using it because the data is sent from S3 to the outside of AWS and the transfer amount is generated. In the first place, heavy data is difficult to handle due to the problem of transfer time. , If you want to run the source code that is supposed to use S3 in a different environment, you should check it. It may be wise not to set the AWS key as the default locally.
-Introduction of Spark to EC2 and cooperation of iPython Notebook --Qiita
Recommended Posts