In February 2020, Oracle Cloud Infrastructure Data Science (OCI-Data Science) was released. Last time, I tried to query the data of the file on AWS S3 from OCI-Data Science (Qiita article: Try to access the file data of OCI Object Storage and AWS S3 using Oracle Cloud Infrastructure Data Science), In this article, I would like to carry out the procedure to access and acquire data on AWS Redshift using Python from OCI-Data Science. ..
-Let's use Oracle Cloud Infrastructure Data Science (OCI-Data Science) -OCI-Data Science Official Document -Oracle Accelerated Data Science SDK (ADS) Official Documentation -Qiita article: I tried Redshift for the first time! -What can Amazon Redshift do? Explaining AWS Data Warehouse Services -Read data from Redshift and put it in pandas data frame -Qiita article: Connect to ubuntu16.04@AWS postgresql from remote python3 (psycopg2)
Follow the procedure below.
Build a notebook environment for OCI-Data Science and make initial settings. Do the following with reference to "Let's use Oracle Cloud Infrastructure Data Science (OCI-Data Science)" in the above reference.
-Build a notebook environment after the basic settings of Oracle Cloud. -Use ** getting-started.ipynb ** to perform the initial work of the notebook environment (JupyerLab).
I created a Redshift cluster by referring to the above reference "Qiita article: I tried Redshift for the first time!".
This time, refer to "Trying to use Amazon Redshift" in "What can Amazon Redshift do? Explaining AWS data warehouse services" in the above reference, ** Creating an IAM role **, ** I registered the data by following the steps of Creating a table on Redshift **, ** Registering data from the csv file on S3 to the table on Redshift **.
This time, the following table is created and sample data is registered.
CREATE TABLE PURCHASE_ITEM (
CUST_ID integer,
AGE integer,
MARRIED VARCHAR(4000),
ADDRESS VARCHAR(4000),
CHILD VARCHAR(4000),
OCCUPATION VARCHAR(4000),
LASTCONTACT VARCHAR(4000),
LASTCALL integer,
CONTACT integer,
CONTACT_BEFORE_CAMPAIGN integer,
Purchased VARCHAR(4000)
);
Now, let's actually access the Redshift table from OCI-Data Science.
This time, we will use sqlalchemy-redshift
as described in the above reference" Reading data from Redshift and putting it in the pandas data frame ".
First, install the module ** sqlalchemy-redshift **.
pip install sqlalchemy-redshift
Then install the required libraries.
import redshift_sqlalchemy
from sqlalchemy import create_engine
Connect to Redshift.
engine = create_engine('{dialect}+{driver}://{user}:{pwd}@{url}:{port}/{db}'.format(
dialect = 'redshift',
driver = 'psycopg2',
user='awsuser', #Redshift username
pwd ='XXXXXX', #password
url='redshift-cluster-1.XXX.XXX.redshift.amazonaws.com', #Redshift cluster screen endpoint
port=5439, #Redshift port number
db='dev' #Redshift database name
))
Here, the following ** Connection timed out ** error occurred.
After investigating, it looks the same as the above reference material "Connecting to ubuntu16.04@AWS postgresql from remote python3 (psycopg2)", so set the ** VNC security group inbound rule ** of Redshift as follows. To do.
After setting the security group, I was able to connect normally.
Read the data on Redshift and put it in a data frame.
import pandas as pd
redshift_data = pd.read_sql_query('SELECT * FROM PURCHASE_ITEM limit 100;', engine)
Let's check the result.
redshift_data.head()
Certainly, I was able to get it safely.
This time, I tried to access the data on AWS Redshift from OCI-Data Science using ** sqlalchemy-redshift **.
There may be a better connection method than this, so please give it a try. (And tell me ...)
OCI-Data Science is a target service of Oracle Free Trial that allows you to use free credit for up to $ 300 for up to 30 days. Therefore, we recommend that you try it out.
Let's Enjoy Data Science!