Automatically update CSV files to AWS DynamoDB

Introduction

I often output the data collected by scraping as a CSV file. However, in order to use the data efficiently, the data in the CSV file must be stored in a database or the like. Even if you save labor in collecting data, it is painful to spend time operating the database, isn't it? Let's solve that problem with a cloud database service!

In this post, I will introduce how to automatically reflect the data in the CSV file to DynamoDB on AWS. The troublesome database update work is Osaraba!

table of contents

--AWS CLI Settings -DynamoDB table update code (Python) -Method 1: Periodically execute with cron from virtual machine (VM) -Method 2: GitHub Actions -Comparison of two methods

This time, let's assume that you already have DynamoDB that you want to update automatically on AWS, and you want to update that file with a CSV file obtained by scraping.

AWS CLI settings

To connect to AWS, install the AWS CLI and perform the initial setup. Follow the instructions and register various settings. Prepare the access key and secret access key by executing "Create access key" from the AWS security authentication page. Reference: Initial setting memo from AWS CLI installation

$ sudo pip install awscli
$ aws configure

AWS Access Key ID [None]: {access key}
AWS Secret Access Key [None]: {Secret access key}
Default region name [None]: {AWS Region}
Default output format [None]: {json, yaml, text,Select output format from table}

Check the settings with aws configure list.

DynamoDB table update code (Python)

For example, suppose you have the following CSV file obtained by scraping.

Fried.csv


Ground Meat Cutlet,Bread crumbs,Minced meat,mother
Fried Shrimp,Bread crumbs,shrimp,mother
Shrimp heaven,Tempura powder,shrimp,mother
Imoten,Tempura powder,sweet potato,mother
Daifuku,Mochi,Anko,mother

Use this to update the following DynamoDB

Cooking (primary key) Clothing contents Cook
Ground Meat Cutlet Bread crumbs Minced meat mother

To ** add ** the CSV contents to DynamoDB, execute the update code below.

addDynamoDB.py


import csv
import boto3
from boto3.dynamodb.conditions import Key

def addDynamoDB():
    dynamodb = boto3.resource("dynamodb", region_name="Database region")
    table = dynamodb.Table("DynamoDB table name") #Specify the table name and store it in a variable

    filepath = "Fried.csv"
    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        # batch_writer()So, put all the CSV items 25 items at a time
        with table.batch_writer() as batch:
            for row in reader:
                item = {
                    "cuisine": row[0],
                    "Clothing": row[1],
                    "contents": row[2],
                    "Cook": row[3],
                }
                batch.put_item(Item=item)

if __name__ == "__main__":
    addDynamoDB()

** Database after executing addDynamoDB.py **

Cooking (primary key) Clothing contents Cook
Ground Meat Cutlet Bread crumbs Minced meat mother
Fried Shrimp Bread crumbs shrimp mother
Shrimp heaven Tempura powder shrimp mother
Imoten Tempura powder sweet potato mother
Daifuku Mochi Anko mother

You can now update DynamoDB.

But there is a problem here. Daifuku is not fried food, isn't it? table.batch_writer () can add DynamoDB, but cannot delete items. For example, even if you modify fried food .csv as follows, addDynamoDB.py does not change DynamoDB.

Fried.csv


Ground Meat Cutlet,Bread crumbs,Minced meat,mother
Fried Shrimp,Bread crumbs,shrimp,mother
Shrimp heaven,Tempura powder,shrimp,mother
Imoten,Tempura powder,sweet potato,mother

To ** synchronize ** the contents of DynamoDB and CSV, execute the update code below.

updateDynamoDB.py


import csv
import boto3
from boto3.dynamodb.conditions import Key

def updateDynamoDB():
    dynamodb = boto3.resource("dynamodb", region_name="Database region")
    table = dynamodb.Table("DynamoDB table name") #Specify the table name and store it in a variable

    #Query DynamoDB and get data
    response = table.query(
        IndexName="DynamoDB index name", #Specifying the index name
        #If you have other options, fill in below
        #Example:If the "cook" in the table wants to get only the "mother" element, set the following
        KeyConditionExpression=Key("Cook").eq("mother") #Example
    )
    dbItems = response['Items']
    #If the response exceeds 1MB, loop until the LastEvaluatedKey is no longer included
    while 'LastEvaluatedKey' in response:
        response = table.query(
            IndexName="DynamoDB index name", #Specifying the index name
            ExclusiveStartKey=response['LastEvaluatedKey'],
            #If you have other options, enter them here
            #Example:If the "cook" in the table wants to get only the "mother" element, set the following
            KeyConditionExpression=Key("Cook").eq("mother") #Example
        )
        dbItems.extend(response['Items'])

    csvItems = []
    filepath = "Fried.csv"
    with open(filepath, "r", encoding="utf-8") as f:
        reader = csv.reader(f)
        # batch_writer()So, put all the CSV items 25 items at a time
        with table.batch_writer() as batch:
            for row in reader:
                item = {
                    "cuisine(Primary key)": row[0],
                    "Clothing": row[1],
                    "contents": row[2],
                    "Cook": row[3],
                }
                batch.put_item(Item=item)

                csvItems.append(row[0])

            #The model number that does not exist in the CSV file is batch.delete_item()Delete from DB with
            for dbItem in dbItems:
                if dbItem["cuisine(Primary key)"] not in csvItems:
                    batch.delete_item(
                        Key={
                            "cuisine(Primary key)": dbItem["cuisine(Primary key)"],
                        }
                    )

if __name__ == "__main__":
    updateDynamoDB()

** Database after executing updateDynamoDB.py **

Cooking (primary key) Clothing contents Cook
Ground Meat Cutlet Bread crumbs Minced meat mother
Fried Shrimp Bread crumbs shrimp mother
Shrimp heaven Tempura powder shrimp mother
Imoten Tempura powder sweet potato mother

You can now update DynamoDB by running the code manually. So how do you automate this? There are several methods, but this time I would like to introduce two of them.

Method 1: Periodically run with cron from a virtual machine (VM)

One method is to launch a VM instance with an environment such as AWS CLI on the cloud. VM instances use the Google Cloud Engine free tier, which we covered in previous articles. Past Articles: Crawling Easy to Start in the Cloud

Put the above code (#dynamodbのテーブル更新用コード-python) on the VM instance and run it regularly with cron. Set cron as follows with the crontab -u <username> -e command.

SHELL=/bin/bash
CRON_TZ="Japan"
00 00 01 * * python3 /<Any absolute path>/updateDynamoDB.py

Change the 00 00 01 * * part to any time and frequency you want to update DynamoDB.

If you have a crawler like Last article, it is convenient to implement it so that it will be reflected in DynamoDB as soon as the crawler finishes running.

Method 2: GitHub Actions

By managing the CSV file on GitHub, it is possible to implement that DynamoDB is automatically updated at the time of git push. Specifically, it's done in a GitHub Actions workflow.

Create a .github/workflows / directory in your Git repository and define it in YAML format. The YAML file is described as follows, for example.

dynamoDB.yml


name:Any Action name

on:
  push:
    branches:
      # Run only on changes on master branches
      - 'master'
    paths:
      # Run only on changes on these files
      - '**/Fried.csv'

jobs:
  build:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
      with:
        fetch-depth: 25
    - name: Python3.8 settings
      uses: actions/setup-python@v2
      with:
        python-version: 3.8
    - name:Python library installation
      run: |
        python -m pip install --upgrade pip
        pip install boto3
    - name:Setting AWS credentials
      uses: aws-actions/configure-aws-credentials@v1
      with:
        aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
        aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        aws-region:Any region
    - name:DynamoDB update
      run: |
        python /<Any absolute path>/updateDynamoDB.py

This will trigger updateDynamoDB.py to be pushed to the master branch.

Regarding the notation of YAML files, the following articles were especially helpful along with the GitHub official. Thank you very much. Reference: Notes on how to use Github Actions You can find out more about the GitHub action "Setting AWS Credentials" on the GitHub Official Page (https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions).

Comparison of the two methods

Run from VM (GCE) GitHub Actions
File timestamp Yes None
Environment Requires GCE VM instance Requires GitHub repository and git settings
Highly flexible operation[^1] Relatively easy
(Cloud services are also available)
difficult

[^ 1]: Special operations tend to be complicated. For example, if you want to output and browse intermediate files that you do not want to push to GitHub, you need to prepare a separate storage for intermediate files or devise git commands. |trigger|Basically only the range that can be cron|リモートへのpushやpull requestの作成をtriggerにできる| |Fee|US region f1-Free for just one micro instance. Depending on the region or instanceDepending on the planFeeが発生する(E2-in medium$0.01005/time)|リポジトリが公開なら無料。非公開なら、Actionsの実行timeが2000Minutes/Beyond the moon利用timeに応じてFeeが発生する(On Linux$0.008/Minutes)| |VM specs|chooseable[1]|Fixed[^3]|

As for the impression, Execution from a VM is for ** tasks that you want to utilize together with cloud services, or tasks that have long execution times and are complicated **. I felt that GitHub Actions was suitable for tasks that I wanted to trigger ** git updates, or for simple tasks with less execution time **.

Please use the one that suits your purpose.

reference

SSH key registration on Github SSH communication settings on Github Install AWS CLI (https://qiita.com/yuyj109/items/3163a84480da4c8f402c) Initial setting memo from AWS CLI installation Notes on how to use Github Actions Set AWS Credentials Action for GitHub Action Notes on how to use Github Actions Web scraping to get started easily with Python Crawling to get started easily in the cloud


  1. Official document:Machine type|Compute Engine documentation| Google Cloud [^ 3]: Standard_DS2_v2, CPUx2, Memory 7Gb, Storage 14Gb, MAC Stadium ↩︎

Recommended Posts

Automatically update CSV files to AWS DynamoDB
How to read CSV files in Pandas
How to automatically upload .gpx files to Strava
I want to visualize csv files using Vega-Lite!
Convert UTF-8 CSV files to read in Excel
Read CSV files uploaded to Flask without saving
Command to automatically update pip library at once
[AWS] Migrate data from DynamoDB to Aurora MySQL
How to make AWS rekognition recognize local image files
I convert AWS JSON data to CSV like this
2 ways to read all csv files in a folder
How to update easy_install
[blackbird-dynamodb] Monitoring AWS DynamoDB
How to update Spyder
update django version 1.11.1 to 2.2
[Python] Reading CSV files
[R] [Python] Memo to read multiple csv files in multiple zip files
Batch convert all xlsx files in the folder to CSV files
[Python] Until scraping beginners save J-League standings to CSV files