Try to extract specific data from JSON format data in object storage Cloudian/S3

Introduction

You can use the AWS SDK for Python (boto3) to access Cloudian and S3 with almost no changes to your program. I hope it will be a reference for those who want to use object storage in a hybrid (on-premises: Cloudian, AWS: S3).

Overview

Object storage A Python program that extracts specific data from a JSON-formatted data file in the bucket name "boto3-cloudian" on Cloudian/S3.

The JSON format data file is "test-iot-dummy.json" generated by this, and there are 100,000 data items. From that data, items with item "section" = "R" are extracted, and the number of data items and the time taken for extraction are displayed.

The extraction item "section" can also be defined as a parameter (one uppercase letter of the alphabet). See also the help displayed by specifying the parameter "-h" when the program is executed.

Execution environment

macOS Big Sur 11.1 python 3.8.3

Definition of credential information

This time, the credential information is defined in .zshenv and the program is executed. Please define according to the connection destination.

# AWS S3
export AWS_ACCESS_KEY_ID=xxxxxxxxxxxxx
export AWS_SECRET_ACCESS_KEY=yyyyyyyyyyyyyyyyy
export AWS_DEFAULT_REGION=ap-northeast-1

# Cloudian
#export AWS_ACCESS_KEY_ID=aaaaaaaaaaaaaaaaaa
#export AWS_SECRET_ACCESS_KEY=bbbbbbbbbbbbbbbbbbbb
#export AWS_DEFAULT_REGION=pic

Execution program

If you want to access Cloudian, please enter endpoint_url (see in the program).

`IoTSample-read.py`



import json
import time
import argparse
import boto3

BUCKET_NAME = 'boto3-cloudian'
OBJECT_KEY = 'test-iot-dummy.json'


# S3-Get all data with API and filter with logic(Item: Section)
def section_main(section):

    # client = boto3.client('s3', endpoint_url='http://s3-pic.networld.local')    #When accessing Cloudian
    client = boto3.client('s3')                                                 #When accessing S3
    response = client.get_object(
        Bucket=BUCKET_NAME,
        Key=OBJECT_KEY
    )

    target = []
    if 'Body' in response:
        body = response['Body'].read()
        text = body.decode('utf-8')
        items = json.loads(text)
        items = items['items']
        target = list(filter(lambda x: x['section'] == section, items))
    # print(target)
    return len(target)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Extraction of dummy data for IoT devices')
    parser.add_argument('--section', type=str, default='R', help='Specify the section to extract (one uppercase letter of the alphabet)')
    args = parser.parse_args()

    start = time.time()
    
    num = section_main(args.section)

    select_time = time.time() - start

    print("")
    print(f"Number of data extractions:{num}")
    print("Extraction processing time(Normal_API):{0}".format(select_time) + " [sec]")
    print("")

Program execution

Let's get help first.

$ python IoTSample-read.py -h             
usage: IoTSample-read.py [-h] [--section SECTION]

Extraction of dummy data for IoT devices

optional arguments:
  -h, --help         show this help message and exit
  --section SECTION Specify the section to extract (one uppercase letter of the alphabet)

Now, let's perform data extraction with the default extraction items.

$ python IoTSample-read.py

Number of data extractions:3929
Extraction processing time(Normal_API):3.6708199977874756 [sec]

Next, it is the execution when the extraction item is specified.

$ python IoTSample-read.py --section Z

Number of data extractions:3792
Extraction processing time(Normal_API):3.5160200595855713 [sec]

Summary

This time, I was able to confirm that the data was extracted from the object storage Cloudian / S3 using the AWS SDK for Python (boto3) (the target data was extracted from 100,000 data in a few seconds (of course, it depends on the environment). To do)).

For Cloudian, check here (https://qiita.com/yamahiro/items/7b8a11c773106b641795).