I often output the data collected by scraping as a CSV file. However, in order to use the data efficiently, the data in the CSV file must be stored in a database or the like. Even if you save labor in collecting data, it is painful to spend time operating the database, isn't it? Let's solve that problem with a cloud database service!
In this post, I will introduce how to automatically reflect the data in the CSV file to DynamoDB on AWS. The troublesome database update work is Osaraba!
--AWS CLI Settings -DynamoDB table update code (Python) -Method 1: Periodically execute with cron from virtual machine (VM) -Method 2: GitHub Actions -Comparison of two methods
This time, let's assume that you already have DynamoDB that you want to update automatically on AWS, and you want to update that file with a CSV file obtained by scraping.
To connect to AWS, install the AWS CLI and perform the initial setup. Follow the instructions and register various settings. Prepare the access key and secret access key by executing "Create access key" from the AWS security authentication page. Reference: Initial setting memo from AWS CLI installation
$ sudo pip install awscli
$ aws configure
AWS Access Key ID [None]: {access key}
AWS Secret Access Key [None]: {Secret access key}
Default region name [None]: {AWS Region}
Default output format [None]: {json, yaml, text,Select output format from table}
Check the settings with aws configure list
.
For example, suppose you have the following CSV file obtained by scraping.
Fried.csv
Ground Meat Cutlet,Bread crumbs,Minced meat,mother
Fried Shrimp,Bread crumbs,shrimp,mother
Shrimp heaven,Tempura powder,shrimp,mother
Imoten,Tempura powder,sweet potato,mother
Daifuku,Mochi,Anko,mother
Use this to update the following DynamoDB
Cooking (primary key) | Clothing | contents | Cook |
---|---|---|---|
Ground Meat Cutlet | Bread crumbs | Minced meat | mother |
To ** add ** the CSV contents to DynamoDB, execute the update code below.
addDynamoDB.py
import csv
import boto3
from boto3.dynamodb.conditions import Key
def addDynamoDB():
dynamodb = boto3.resource("dynamodb", region_name="Database region")
table = dynamodb.Table("DynamoDB table name") #Specify the table name and store it in a variable
filepath = "Fried.csv"
with open(filepath, "r", encoding="utf-8") as f:
reader = csv.reader(f)
# batch_writer()So, put all the CSV items 25 items at a time
with table.batch_writer() as batch:
for row in reader:
item = {
"cuisine": row[0],
"Clothing": row[1],
"contents": row[2],
"Cook": row[3],
}
batch.put_item(Item=item)
if __name__ == "__main__":
addDynamoDB()
** Database after executing addDynamoDB.py **
Cooking (primary key) | Clothing | contents | Cook |
---|---|---|---|
Ground Meat Cutlet | Bread crumbs | Minced meat | mother |
Fried Shrimp | Bread crumbs | shrimp | mother |
Shrimp heaven | Tempura powder | shrimp | mother |
Imoten | Tempura powder | sweet potato | mother |
Daifuku | Mochi | Anko | mother |
You can now update DynamoDB.
But there is a problem here. Daifuku is not fried food, isn't it?
table.batch_writer ()
can add DynamoDB, but cannot delete items.
For example, even if you modify fried food .csv as follows, addDynamoDB.py does not change DynamoDB.
Fried.csv
Ground Meat Cutlet,Bread crumbs,Minced meat,mother
Fried Shrimp,Bread crumbs,shrimp,mother
Shrimp heaven,Tempura powder,shrimp,mother
Imoten,Tempura powder,sweet potato,mother
To ** synchronize ** the contents of DynamoDB and CSV, execute the update code below.
updateDynamoDB.py
import csv
import boto3
from boto3.dynamodb.conditions import Key
def updateDynamoDB():
dynamodb = boto3.resource("dynamodb", region_name="Database region")
table = dynamodb.Table("DynamoDB table name") #Specify the table name and store it in a variable
#Query DynamoDB and get data
response = table.query(
IndexName="DynamoDB index name", #Specifying the index name
#If you have other options, fill in below
#Example:If the "cook" in the table wants to get only the "mother" element, set the following
KeyConditionExpression=Key("Cook").eq("mother") #Example
)
dbItems = response['Items']
#If the response exceeds 1MB, loop until the LastEvaluatedKey is no longer included
while 'LastEvaluatedKey' in response:
response = table.query(
IndexName="DynamoDB index name", #Specifying the index name
ExclusiveStartKey=response['LastEvaluatedKey'],
#If you have other options, enter them here
#Example:If the "cook" in the table wants to get only the "mother" element, set the following
KeyConditionExpression=Key("Cook").eq("mother") #Example
)
dbItems.extend(response['Items'])
csvItems = []
filepath = "Fried.csv"
with open(filepath, "r", encoding="utf-8") as f:
reader = csv.reader(f)
# batch_writer()So, put all the CSV items 25 items at a time
with table.batch_writer() as batch:
for row in reader:
item = {
"cuisine(Primary key)": row[0],
"Clothing": row[1],
"contents": row[2],
"Cook": row[3],
}
batch.put_item(Item=item)
csvItems.append(row[0])
#The model number that does not exist in the CSV file is batch.delete_item()Delete from DB with
for dbItem in dbItems:
if dbItem["cuisine(Primary key)"] not in csvItems:
batch.delete_item(
Key={
"cuisine(Primary key)": dbItem["cuisine(Primary key)"],
}
)
if __name__ == "__main__":
updateDynamoDB()
** Database after executing updateDynamoDB.py **
Cooking (primary key) | Clothing | contents | Cook |
---|---|---|---|
Ground Meat Cutlet | Bread crumbs | Minced meat | mother |
Fried Shrimp | Bread crumbs | shrimp | mother |
Shrimp heaven | Tempura powder | shrimp | mother |
Imoten | Tempura powder | sweet potato | mother |
You can now update DynamoDB by running the code manually. So how do you automate this? There are several methods, but this time I would like to introduce two of them.
One method is to launch a VM instance with an environment such as AWS CLI on the cloud. VM instances use the Google Cloud Engine free tier, which we covered in previous articles. Past Articles: Crawling Easy to Start in the Cloud
Put the above code (#dynamodbのテーブル更新用コード-python) on the VM instance and run it regularly with cron.
Set cron as follows with the crontab -u <username> -e
command.
SHELL=/bin/bash
CRON_TZ="Japan"
00 00 01 * * python3 /<Any absolute path>/updateDynamoDB.py
Change the 00 00 01 * *
part to any time and frequency you want to update DynamoDB.
If you have a crawler like Last article, it is convenient to implement it so that it will be reflected in DynamoDB as soon as the crawler finishes running.
By managing the CSV file on GitHub, it is possible to implement that DynamoDB is automatically updated at the time of git push. Specifically, it's done in a GitHub Actions workflow.
Create a .github/workflows /
directory in your Git repository and define it in YAML format.
The YAML file is described as follows, for example.
dynamoDB.yml
name:Any Action name
on:
push:
branches:
# Run only on changes on master branches
- 'master'
paths:
# Run only on changes on these files
- '**/Fried.csv'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 25
- name: Python3.8 settings
uses: actions/setup-python@v2
with:
python-version: 3.8
- name:Python library installation
run: |
python -m pip install --upgrade pip
pip install boto3
- name:Setting AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region:Any region
- name:DynamoDB update
run: |
python /<Any absolute path>/updateDynamoDB.py
This will trigger updateDynamoDB.py
to be pushed to the master branch.
Regarding the notation of YAML files, the following articles were especially helpful along with the GitHub official. Thank you very much. Reference: Notes on how to use Github Actions You can find out more about the GitHub action "Setting AWS Credentials" on the GitHub Official Page (https://github.com/marketplace/actions/configure-aws-credentials-action-for-github-actions).
Run from VM (GCE) | GitHub Actions | |
---|---|---|
File timestamp | Yes | None |
Environment | Requires GCE VM instance | Requires GitHub repository and git settings |
Highly flexible operation[^1] | Relatively easy (Cloud services are also available) |
difficult |
[^ 1]: Special operations tend to be complicated. For example, if you want to output and browse intermediate files that you do not want to push to GitHub, you need to prepare a separate storage for intermediate files or devise git commands. |trigger|Basically only the range that can be cron|リモートへのpushやpull requestの作成をtriggerにできる| |Fee|US region f1-Free for just one micro instance. Depending on the region or instanceDepending on the planFeeが発生する(E2-in medium$0.01005/time)|リポジトリが公開なら無料。非公開なら、Actionsの実行timeが2000Minutes/Beyond the moon利用timeに応じてFeeが発生する(On Linux$0.008/Minutes)| |VM specs|chooseable[1]|Fixed[^3]|
As for the impression, Execution from a VM is for ** tasks that you want to utilize together with cloud services, or tasks that have long execution times and are complicated **. I felt that GitHub Actions was suitable for tasks that I wanted to trigger ** git updates, or for simple tasks with less execution time **.
Please use the one that suits your purpose.
SSH key registration on Github SSH communication settings on Github Install AWS CLI (https://qiita.com/yuyj109/items/3163a84480da4c8f402c) Initial setting memo from AWS CLI installation Notes on how to use Github Actions Set AWS Credentials Action for GitHub Action Notes on how to use Github Actions Web scraping to get started easily with Python Crawling to get started easily in the cloud
Official document:Machine type|Compute Engine documentation| Google Cloud [^ 3]: Standard_DS2_v2, CPUx2, Memory 7Gb, Storage 14Gb, MAC Stadium ↩︎
Recommended Posts