Try to extract specific data from JSON format data in object storage Cloudian/S3

Introduction

You can use the AWS SDK for Python (boto3) to access Cloudian and S3 with almost no changes to your program. I hope it will be a reference for those who want to use object storage in a hybrid (on-premises: Cloudian, AWS: S3).

Overview

Object storage A Python program that extracts specific data from a JSON-formatted data file in the bucket name "boto3-cloudian" on Cloudian/S3.

The JSON format data file is "test-iot-dummy.json" generated by this, and there are 100,000 data items. From that data, items with item "section" = "R" are extracted, and the number of data items and the time taken for extraction are displayed.

The extraction item "section" can also be defined as a parameter (one uppercase letter of the alphabet). See also the help displayed by specifying the parameter "-h" when the program is executed.

Execution environment

macOS Big Sur 11.1 python 3.8.3

Definition of credential information

This time, the credential information is defined in .zshenv and the program is executed. Please define according to the connection destination.

# AWS S3
export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyy
export AWS_DEFAULT_REGION=ap-northeast-1

# Cloudian
#export AWS_ACCESS_KEY_ID=aaaaaaaaaaaaaaaaaa
#export AWS_SECRET_ACCESS_KEY=bbbbbbbbbbbbbbbbbbbb
#export AWS_DEFAULT_REGION=pic

Execution program

If you want to access Cloudian, please enter endpoint_url (see in the program).

IoTSample-read.py



import json
import time
import argparse
import boto3

BUCKET_NAME = 'boto3-cloudian'
OBJECT_KEY = 'test-iot-dummy.json'


# S3-Get all data with API and filter with logic(Item: Section)
def section_main(section):

    # client = boto3.client('s3', endpoint_url='http://s3-pic.networld.local')    #When accessing Cloudian
    client = boto3.client('s3')                                                 #When accessing S3
    response = client.get_object(
        Bucket=BUCKET_NAME,
        Key=OBJECT_KEY
    )

    target = []
    if 'Body' in response:
        body = response['Body'].read()
        text = body.decode('utf-8')
        items = json.loads(text)
        items = items['items']
        target = list(filter(lambda x: x['section'] == section, items))
    # print(target)
    return len(target)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Extraction of dummy data for IoT devices')
    parser.add_argument('--section', type=str, default='R', help='Specify the section to extract (one uppercase letter of the alphabet)')
    args = parser.parse_args()

    start = time.time()
    
    num = section_main(args.section)

    select_time = time.time() - start

    print("")
    print(f"Number of data extractions:{num}")
    print("Extraction processing time(Normal_API):{0}".format(select_time) + " [sec]")
    print("")

Program execution

Let's get help first.

$ python IoTSample-read.py -h             
usage: IoTSample-read.py [-h] [--section SECTION]

Extraction of dummy data for IoT devices

optional arguments:
  -h, --help         show this help message and exit
  --section SECTION Specify the section to extract (one uppercase letter of the alphabet)

Now, let's perform data extraction with the default extraction items.

$ python IoTSample-read.py

Number of data extractions:3929
Extraction processing time(Normal_API):3.6708199977874756 [sec]

Next, it is the execution when the extraction item is specified.

$ python IoTSample-read.py --section Z

Number of data extractions:3792
Extraction processing time(Normal_API):3.5160200595855713 [sec]

Summary

This time, I was able to confirm that the data was extracted from the object storage Cloudian / S3 using the AWS SDK for Python (boto3) (the target data was extracted from 100,000 data in a few seconds (of course, it depends on the environment). To do)).

For Cloudian, check here (https://qiita.com/yamahiro/items/7b8a11c773106b641795).

Recommended Posts

Try to extract specific data from JSON format data in object storage Cloudian/S3
Try writing JSON format data to object storage Cloudian/S3
Extract specific data from complex JSON
Convert / return class object to JSON format in Python
Extract classification information etc. from genbank data in xml format
Export DB data in json format
Try to put data in MongoDB
[Cloudian # 2] Try to display the object storage bucket in Python (boto3)
How to generate a Python object from JSON
[Introduction to Python] How to handle JSON format data
Use PIL in Python to extract only the data you want from Exif
Convert json format data to txt (using yolo)
Use slackbot as a relay and return from bottle to slack in json format.
Reading CSV data from DSX object storage Python code
How to apply markers only to specific data in matplotlib
Try to decipher the login data stored in Firefox
Extract data from S3
Try to extract Azure SQL Server data table with pyodbc
Python code for writing CSV data to DSX object storage
Try to extract the keywords that are popular in COTOHA
Hit REST in Python to get data from New Relic
[Python] Use JSON format data as a dictionary type object
Image analysis with Object Detection API to try in 1 hour
Simultaneously input specific data to a specific sheet in many excels
Try to extract the features of the sensor data with CNN
Easily format JSON in Python
Extract specific languages from Wiktionary
Write data in HDF format
Parse JSON file to object
[Cloudian # 9] Try to display the metadata of the object in Python (boto3)
Try to extract a character string from an image with Python3
Find all patterns to extract a specific number from the set