[AWS] Search and acquire necessary data from S3 files with S3 Select

Thing you want to do

I want to use S3 Select to search S3 files with SQL queries to get the records I need. Qiita logo

A happy event

Points of S3 Select

Try to operate from the console screen of S3

This time, we will place a sample file in the following format on S3 and verify it.

sample.tsv


male	14	student
female	26	employee
male	32	selfemployed
male	45	unemployed
female	11	student
male	24	employee
male	33	selfemployed
male	49	unemployed
female	57	unemployed

Select the target file from the S3 console screen and select the S3 Select tab. 1.png

File format: CSV Delimiter: tab Compression: None "next" 2.png

Enter the SQL you want to execute. I want to get a record whose first column is "female".

SQL


select * from s3object s where s._1 = 'female'

3.png I was able to get it properly. You can download it as a CSV file by clicking "Download".

There are some sample SQLs, so give them a try. 5.png

Run SQL queries in Python from an EC2 instance

I'll also try the pattern of executing SQL queries in Python from an EC2 instance. First, install the required Python.

#Install Python3
$ sudo yum update -y
$ sudo yum install python3 -y

#Enable virtual environment
$ python3 -m venv ~/s3select_example/env
$ source ~/s3select_example/env/bin/activate

#install boto3
$ pip install pip --upgrade
$ pip install boto3

Next, create a Python file.

S3Select.py


import boto3

s3 = boto3.client('s3')

resp = s3.select_object_content(
    Bucket='my-bucket.01',
    Key='sample/sample.tsv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object s where s._1 = 'female'",
    InputSerialization = {'CSV': {"FileHeaderInfo": "NONE", 'FieldDelimiter': '\t'}, 'CompressionType': 'NONE'},
    OutputSerialization = {'CSV': {}},
)

for event in resp['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)

Match Bucket and Key to the file to be read by S3. Also, since the target file this time is tab-delimited, "'\ t'" is specified for the Field Delimiter of Input Serialization.

Let's run it.

$ python S3Select.py

female,26,employee
female,11,student
female,57,unemployed

The same record as the result of executing from the console screen was output!

This time, I searched the S3 file directly with SQL query and fetched the target record. I sometimes investigate S3 files, so I would like to use it if I have the opportunity.

Reference: https://aws.amazon.com/jp/blogs/news/querying-data-without-servers-or-databases-using-amazon-s3-select/

Recommended Posts

[AWS] Search and acquire necessary data from S3 files with S3 Select
[AWS] Link Lambda and S3 with boto3
Settings when reading S3 files with pandas from Jupyter Notebook on AWS
Extract data from S3
Aggregate AWS S3 data
Getting Started with Drawing with matplotlib: Creating Diagrams from Data Files
Repeat with While. Scripts to Tweet and search from the terminal
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Manipulating kintone data with Python & C Data ODBC Driver from AWS Lambda
Upload data to s3 of aws with a command and update it, and delete the used data (on the way)
Let's upload S3 files with CLI
Read and use Python files from Python
Install the data files with setup.py
Save and retrieve files with Pepper
Easy AWS S3 testing with MinIO
Manage your data with AWS RDS
Access WebAPI with Python and register / acquire IoT data (dweet.io, Requests, HTTPie)
Face detection from multiple image files with openCV, cut out and save
From pandas1.0.0, you can specify S3 / GCS URL with read_pickle () and to_pickle ()!
Copy data from Amazon S3 to Google Cloud Storage with Python (boto)
Make a decision tree from 0 with Python and understand it (4. Data structure)