Get and parse S3 Logging logs with Lambda event nortification and plunge into BigQuery

Overview

When using Static Website Hosting, it plays a role as an access log, and it is a Logging function that discharges the log of requests to S3 to the specified bucket, puts an update notification (Event nortification) in the target bucket, and starts Lambda Dive into BigQuery to get, parse, and analyze.

Advance preparation

Lambda runtime environment

--Runtime is Python 2.7 --Required modules (versions actually used in parentheses) --boto3 (Lambda default) - pytz (2015.7) - gcloud (0.8.0)

S3 Logging log format and BigQuery schema

Log format

It is written here. https://docs.aws.amazon.com/AmazonS3/latest/dev/LogFormat.html

Regular expression for parsing logs

Feeling that each label directly corresponds to the column name in the next schema.

^(?P<owner>[^ ]+) (?P<bucket>[^ ]+) \[(?P<datetime>.+)\](?P<remote_ip>[^ ]+) (?P<requester>[^ ]+) (?P<request_id>[^ ]+) (?P<operation>[^ ]+) (?P<key>[^ ]+) "(?P<method>[^ ]+) (?P<uri>[^ ]+) (?P<proto>.+)" (?P<status>[^ ]+) (?P<error>[^ ]+) (?P<bytes>[^ ]+) (?P<size>[^ ]+) (?P<total_time>[^ ]+) (?P<ta_time>[^ ]+) "(?P<referrer>.+)" "(?P<user_agent>.+)" (?P<version>.+)$

Schema

It looks like this according to the format. The datetime column seems to be better with the TIMESTAMP type, but I chose STRING this time due to circumstances, as I will explain in detail later.

Column name Mold
owner STRING
bucket STRING
datetime STRING
remote_ip STRING
requester STRING
request_id STRING
operation STRING
key STRING
method STRING
uri STRING
proto STRING
status STRING
error STRING
bytes INTEGER
size INTEGER
total_time INTEGER
ta_time INTEGER
referrer STRING
user_agent STRING
version STRING

Source

Replace the <your-*> part as appropriate.

import os
import json
import urllib
import boto3
import re
import datetime
import pytz
from gcloud import bigquery

BQ_PROJECT = '<your-project-id>'
BQ_DATASET = '<your-dataset-name>'
BQ_TABLE = '<your-table-name>'

s3 = boto3.client('s3')
bq = bigquery.Client.from_service_account_json(
    os.path.join(os.path.dirname(__file__), 'bq.json'),
    project=BQ_PROJECT)
dataset = bq.dataset(BQ_DATASET)
table = dataset.table(name=BQ_TABLE)
table.reload()

pattern = ' '.join([
    '^(?P<owner>[^ ]+)',
    '(?P<bucket>[^ ]+)',
    '\[(?P<datetime>.+)\]',
    '(?P<remote_ip>[^ ]+)',
    '(?P<requester>[^ ]+)',
    '(?P<request_id>[^ ]+)',
    '(?P<operation>[^ ]+)',
    '(?P<key>[^ ]+)',
    '"(?P<method>[^ ]+) (?P<uri>[^ ]+) (?P<proto>.+)"',
    '(?P<status>[^ ]+)',
    '(?P<error>[^ ]+)',
    '(?P<bytes>[^ ]+)',
    '(?P<size>[^ ]+)',
    '(?P<total_time>[^ ]+)',
    '(?P<ta_time>[^ ]+)',
    '"(?P<referrer>.+)"',
    '"(?P<user_agent>.+)"',
    '(?P<version>.+)$'])
log_pattern = re.compile(pattern)

def to_int(val):
    try:
        ret = int(val)
    except ValueError:
        ret = None
    return ret

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key']).decode('utf8')
    res = s3.get_object(Bucket=bucket, Key=key)
    body = res['Body'].read()
    rows = []

    for line in body.splitlines():
        matches = log_pattern.match(line)
        dt_str = matches.group('datetime').split(' ')[0]
        timestamp = datetime.datetime.strptime(
            dt_str, '%d/%b/%Y:%H:%M:%S').replace(tzinfo=pytz.utc)

        rows.append((
            matches.group('owner'),
            matches.group('bucket'),
            timestamp.strftime('%Y-%m-%d %H:%M:%S'),
            matches.group('remote_ip'),
            matches.group('requester'),
            matches.group('request_id'),
            matches.group('operation'),
            matches.group('key'),
            matches.group('method'),
            matches.group('uri'),
            matches.group('proto'),
            matches.group('status'),
            matches.group('error'),
            to_int(matches.group('bytes')),
            to_int(matches.group('size')),
            to_int(matches.group('total_time')),
            to_int(matches.group('ta_time')),
            matches.group('referrer'),
            matches.group('user_agent'),
            matches.group('version'),))
    print(table.insert_data(rows))

Precautions etc.

--Get JSON-formatted credentials from GCP's API Manager and include them in your Lambda function deployment package with the name bq.json [^ 1] --In the main body, the datetime part should be able to be parsed with% d /% b /% Y:% H:% M:% S% z, but due to a bug in Python <3.2, % z can be used with an error. Since there is no such thing, it is an irregular method involving pytz [^ 2] --Originally, it is correct to specify the datetime object in the TIMESTAMP type column, but is it a bug of gcloud (0.8.0)? Note that BigQuery is in seconds, but trying to save in milliseconds will result in an impossible date. --Each item in the log will be - if there is no data to output. Note that items that you want to convert to int will be an exception if you convert them to ʻint ('-') `as they are. --The Official Document of Google Cloud Client for Python (gcloud) is old in some places, so Source .com / GoogleCloudPlatform / gcloud-python) It was necessary to implement while watching (as of November 26, 2015)

By the way, the deployment of Lambda for Python is my work, but I am using the following. It's still under development, but it's becoming more convenient to use normally, so please try it if you like. https://github.com/marcy-terui/lamvery

[^ 1]: Looking for a smart way to pass confidential information on Lambda [^ 2]: Where you want Python3 support as soon as possible

Recommended Posts

Get and parse S3 Logging logs with Lambda event nortification and plunge into BigQuery
[AWS] Link Lambda and S3 with boto3
Output CloudWatch Logs to S3 with AWS Lambda (Pythyon ver)
[Python] Regularly export from CloudWatch Logs to S3 with Lambda