Currently, I am involved in app log analysis as an intern at EXIDEA Co., Ltd., which develops SEO writing tools. So I put a lot of log data into Pandas dataframes with jupyter notebook. However, I noticed that there was no article that simply wrote about the method. No matter how much you want to analyze, if you don't put the log data in pandas, nothing will start. So, this time, I will actually explain using raw log data. Let's take a look!
・ Collect the information you want with commands into a text file -Store the text file in the data frame with pd.read_csv ()
As a sample, we will use the Nginx access log.
172.17.x.xxx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 200 5032 "http://example.net/" "Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/112.0.316532311 Mobile/15E148 Safari/604.1" "203.0.113.195"
172.17.x.xx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 304 0 "http://example.net/" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 YJApp-IOS jp.co.yahoo.ipn.appli/4.16.14" "203.0.113.195"
172.17.x.xxx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 304 0 "http://example.net/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36" "203.0.113.195"
This operation mainly uses the sed and awk commands. As a flow,
test.txt
WhiskyWhiskyWhisky
Basic grammar
$ sed 's/Substitution source/After replacement/g'
ex)
$sed 's/Whisky/Beer/g' test.txt
>>>BeerBeerBeer
This will format the log data by replacing unnecessary characters with whitespace characters. In this case, [] and "" will get in the way when you put them in a data frame with Pandas, so process them in advance.
test.txt
apple orange grape banana
Basic grammar
$ awk '{print desired field}'
ex)#I want the 1st and 3rd rows
$ awk '{print $1,$3}' test.txt
>>> apple grape
This time, I want the IP address, time, request, path, status code, referer. Extract columns 1, 4, 6, 7, and 11.
The following code summarizes the command processing.
cat access.log | sed 's/\[//g' -e 's/\]//g' -e 's/"//g' | awk '{print $1,$4,$6,$7,$11}' > test.txt
-First, open access.log with the cat command. (If you want to execute each command at once, connect with |.) -After that, remove [] and "" with the sed command. (The sed command can be replaced continuously by writing -e) -Next, extract the fields you want with the awk command -Finally, convert those transformed access.logs to test.txt
172.17.x.xxx 23/Jun/2020:06:25:18 GET /xxxxx.js 200 http://example.net/
172.17.x.xx 23/Jun/2020:06:25:18 GET /xxxxx.js 304 http://example.net/
172.17.x.xxx 23/Jun/2020:06:25:18 GET /xxxxx.js 304 http://example.net/
By the processing so far, it became a text file containing only the information for which you want log data. From here it ends in one shot.
import pandas as pd
columns=["IP","Datetime","method","URI","status","referer"]
df = pd.read_csv('test.txt',delimiter=' ',names=columns) #Delimiter is blank
The result is here.
After this, you can perform time series analysis by performing further preprocessing.
The method introduced in this article is the one that I personally found the easiest to do. So, if there is an easier way, I would appreciate it if you could let me know in the comments.
Recommended Posts