Background

Currently, I am involved in app log analysis as an intern at EXIDEA Co., Ltd., which develops SEO writing tools. So I put a lot of log data into Pandas dataframes with jupyter notebook. However, I noticed that there was no article that simply wrote about the method. No matter how much you want to analyze, if you don't put the log data in pandas, nothing will start. So, this time, I will actually explain using raw log data. Let's take a look!

Method (2 steps)

・ Collect the information you want with commands into a text file -Store the text file in the data frame with pd.read_csv ()

Log data used this time

As a sample, we will use the Nginx access log.

172.17.x.xxx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 200 5032 "http://example.net/" "Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/112.0.316532311 Mobile/15E148 Safari/604.1" "203.0.113.195"
172.17.x.xx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 304 0 "http://example.net/" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 YJApp-IOS jp.co.yahoo.ipn.appli/4.16.14" "203.0.113.195"
172.17.x.xxx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 304 0 "http://example.net/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36" "203.0.113.195"

step 1

Collect the information you want with the command in a text file

This operation mainly uses the sed and awk commands. As a flow,

Use the sed command to preprocess log data (replace)
Extract the desired fields with the awk command

Handling of sed commands

`test.txt`


WhiskyWhiskyWhisky

Basic grammar
$ sed 's/Substitution source/After replacement/g' 
ex)
$sed 's/Whisky/Beer/g'　test.txt
>>>BeerBeerBeer

This will format the log data by replacing unnecessary characters with whitespace characters. In this case, [] and "" will get in the way when you put them in a data frame with Pandas, so process them in advance.

awk command processing

`test.txt`


apple orange grape banana

Basic grammar
$ awk '{print desired field}' 
ex)#I want the 1st and 3rd rows
$ awk '{print $1,$3}' test.txt
>>> apple grape

This time, I want the IP address, time, request, path, status code, referer. Extract columns 1, 4, 6, 7, and 11.

The actual command this time

The following code summarizes the command processing.

cat access.log | sed 's/\[//g' -e  's/\]//g' -e 's/"//g' | awk '{print $1,$4,$6,$7,$11}' > test.txt

flow

-First, open access.log with the cat command. (If you want to execute each command at once, connect with |.) -After that, remove [] and "" with the sed command. (The sed command can be replaced continuously by writing -e) -Next, extract the fields you want with the awk command -Finally, convert those transformed access.logs to test.txt

Command execution result

172.17.x.xxx 23/Jun/2020:06:25:18 GET /xxxxx.js 200 http://example.net/
172.17.x.xx 23/Jun/2020:06:25:18 GET /xxxxx.js 304 http://example.net/
172.17.x.xxx 23/Jun/2020:06:25:18 GET /xxxxx.js 304 http://example.net/

Step 2

Store the text file in the data frame with pd.read_csv ()

By the processing so far, it became a text file containing only the information for which you want log data. From here it ends in one shot.

import pandas as pd
columns=["IP","Datetime","method","URI","status","referer"]
df = pd.read_csv('test.txt',delimiter=' ',names=columns) #Delimiter is blank

The result is here. スクリーンショット 2020-07-26 17.49.09.jpg

After this, you can perform time series analysis by performing further preprocessing.

Finally

The method introduced in this article is the one that I personally found the easiest to do. So, if there is an easier way, I would appreciate it if you could let me know in the comments.

The first step to log analysis (how to format and put log data in Pandas)

Background

Method (2 steps)

Log data used this time

step 1

Collect the information you want with the command in a text file

Handling of sed commands

test.txt

awk command processing

test.txt

The actual command this time

flow

Command execution result

Step 2

Store the text file in the data frame with pd.read_csv ()

Finally

`test.txt`

`test.txt`