Prepare a high-speed analysis environment by hitting mysql from the data analysis environment

I will write about how to connect to rds from a data analysis tool such as jupyter and get a data frame.

motivation

When you pull data from a database and want to analyze it

  1. Create the result set you want to type SQL
  2. Drop CSV on result set
  3. Put CSV in the analysis environment
  4. Read CSV and create data frame
  5. With the data frame at hand, ready to go

I think that it often follows the process.

However

――If you want to try various data frames --When the desired data frame changes depending on the situation ――If you always want fresh data

Under such circumstances, the previous process will be repeated many times. As a result, workers may repeatedly jump between the database and the analytical environment, or incur some communication costs between the engineer who prepares the data and the data scientist who analyzes it.

Therefore, I thought that the cost could be reduced by connecting the analysis environment to the database as it is.

What happened

I often use Python for data analysis, but this time I tried to connect to the database using Python's MySQL client mysqlclient. (Mysqlclient is the recommended driver for Python's web framework Django)

Without a bastion server

import MySQLdb

connect = MySQLdb.connect(
    host='rds endpoint',
    user='username',
    db='Database name',
    passwd='password'
)

cursor = connect.cursor()
sql = "Write a sql statement such as select as a string"
cursor.execute(sql)
for row in cursor.fetchall():
    print(row)
cursor.close()
connect.close()

The result will be returned as a tuple. Be sure to add limit to the sql specification because the process will not end if the table is large.

If there is a bastion server

If you have a bastion server, use the sshtunnel library to dig an ssh tunnel with mysqlclient.

import MySQLdb
from sshtunnel import SSHTunnelForwarder

with SSHTunnelForwarder(
    ('IP address of bastion server',Step server port number),
    ssh_host_key=None,
    ssh_username='Username to access the bastion server',
    ssh_password=None,
    ssh_pkey='ssh Specify the private key path',
    remote_bind_address=('rds endpoint',rds port number),
    local_bind_address=('local host',Arbitrary port number)
) as tunnel:
    connect = MySQLdb.connect(
        host='local host',
        port=Local on_bind_Arbitrary port number determined by address,
        user='username',
        db='Database name',
        passwd='password'
    )
    cursor = connect.cursor()
    sql = "Write a sql statement such as select as a string"
    cursor.execute(sql)
    for row in cursor.fetchall():
        print(row)
    cursor.close()
    connect.close()

The result is also returned as a tuple.

Summary

――I want to try various data frames ――The data frame you want changes depending on the situation ――I always want fresh data

We were able to meet these demands and provide an environment in which analysts can quickly obtain the data they want.

In addition, since variables can be specified on the python side, there was a secondary merit of being able to flexibly deal with it, such as being able to expand expressions into character strings that specify sql.

Recommended Posts

Prepare a high-speed analysis environment by hitting mysql from the data analysis environment
The first time a programming beginner tried simple data analysis by programming
A story about data analysis by machine learning
Predict stock prices by big data analysis from past data
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
Build a data analysis environment with Kedro + MLflow + Github Actions
Create a data frame from the acquired boat race text data
Build a python data analysis environment on Mac (El Capitan)
Build a Python environment and transfer data to the server
Gzip the data by streaming
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Do a search by image from the camera roll using Pythonista3
[Unexpectedly known? ] Introducing a real day in the data analysis department
Perform morphological analysis in the machine learning environment launched by GCE
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Prepare a distributed load test environment with the Python load test tool Locust
[Mac] Create a Python3 execution environment from the fully initialized state
Create a USB boot Ubuntu with a Python environment for data analysis
First satellite data analysis by Tellus
Prepare the development environment with anyenv
Prepare the environment for Atom for Pythonista
Let's play with the corporate analysis data set "CoARiJ" created by TIS ①
Extract and plot the latest population data from the PDF data provided by the city
[Development environment] How to create a data set close to the production DB
Let's play with the corporate analysis data set "CoARiJ" created by TIS ②
(Preserved version: Updated from time to time) A collection of useful tutorials for data analysis hackathons by Team AI