The first step to log analysis (how to format and put log data in Pandas)

Background

Currently, I am involved in app log analysis as an intern at EXIDEA Co., Ltd., which develops SEO writing tools. So I put a lot of log data into Pandas dataframes with jupyter notebook. However, I noticed that there was no article that simply wrote about the method. No matter how much you want to analyze, if you don't put the log data in pandas, nothing will start. So, this time, I will actually explain using raw log data. Let's take a look!

Method (2 steps)

・ Collect the information you want with commands into a text file -Store the text file in the data frame with pd.read_csv ()

Log data used this time

As a sample, we will use the Nginx access log.

172.17.x.xxx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 200 5032 "http://example.net/" "Mozilla/5.0 (iPhone; CPU iPhone OS 13_5 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) GSA/112.0.316532311 Mobile/15E148 Safari/604.1" "203.0.113.195"
172.17.x.xx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 304 0 "http://example.net/" "Mozilla/5.0 (iPhone; CPU iPhone OS 12_3_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148 YJApp-IOS jp.co.yahoo.ipn.appli/4.16.14" "203.0.113.195"
172.17.x.xxx - - [23/Jun/2020:06:25:18 +0900] "GET /xxxxx.js HTTP/1.1" 304 0 "http://example.net/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36" "203.0.113.195"

step 1

Collect the information you want with the command in a text file

This operation mainly uses the sed and awk commands. As a flow,

  1. Use the sed command to preprocess log data (replace)
  2. Extract the desired fields with the awk command

Handling of sed commands

test.txt


WhiskyWhiskyWhisky
Basic grammar
$ sed 's/Substitution source/After replacement/g' 
ex)
$sed 's/Whisky/Beer/g' test.txt
>>>BeerBeerBeer

This will format the log data by replacing unnecessary characters with whitespace characters. In this case, [] and "" will get in the way when you put them in a data frame with Pandas, so process them in advance.

awk command processing

test.txt


apple orange grape banana
Basic grammar
$ awk '{print desired field}' 
ex)#I want the 1st and 3rd rows
$ awk '{print $1,$3}' test.txt
>>> apple grape 

This time, I want the IP address, time, request, path, status code, referer. Extract columns 1, 4, 6, 7, and 11.

The actual command this time

The following code summarizes the command processing.

cat access.log | sed 's/\[//g' -e  's/\]//g' -e 's/"//g' | awk '{print $1,$4,$6,$7,$11}' > test.txt

flow

-First, open access.log with the cat command. (If you want to execute each command at once, connect with |.) -After that, remove [] and "" with the sed command. (The sed command can be replaced continuously by writing -e) -Next, extract the fields you want with the awk command -Finally, convert those transformed access.logs to test.txt

Command execution result

172.17.x.xxx 23/Jun/2020:06:25:18 GET /xxxxx.js 200 http://example.net/
172.17.x.xx 23/Jun/2020:06:25:18 GET /xxxxx.js 304 http://example.net/
172.17.x.xxx 23/Jun/2020:06:25:18 GET /xxxxx.js 304 http://example.net/

Step 2

Store the text file in the data frame with pd.read_csv ()

By the processing so far, it became a text file containing only the information for which you want log data. From here it ends in one shot.

import pandas as pd
columns=["IP","Datetime","method","URI","status","referer"]
df = pd.read_csv('test.txt',delimiter=' ',names=columns) #Delimiter is blank

The result is here. スクリーンショット 2020-07-26 17.49.09.jpg

After this, you can perform time series analysis by performing further preprocessing.

Finally

The method introduced in this article is the one that I personally found the easiest to do. So, if there is an easier way, I would appreciate it if you could let me know in the comments.

Recommended Posts

The first step to log analysis (how to format and put log data in Pandas)
How to return the data contained in django model in json format and map it on leaflet
I'm addicted to the difference in how Flask and Django receive JSON data
How to split machine learning training data into objective variables and others in Pandas
[Pandas] If the first row data is in the header in DataFrame
Put the lists together in pandas to make a DataFrame
How to log in automatically like 1Password from the CLI
How to get all the keys and values in the dictionary
How to get an overview of your data in Pandas
How to create dataframes and mess with elements in pandas
Data science companion in python, how to specify elements in pandas
How to log in to AtCoder with Python and submit automatically
The minimum methods to remember when aggregating data in Pandas
How to plot the distribution of bacterial composition from Qiime2 analysis data in a box plot
How to write soberly in pandas
Try to put data in MongoDB
The first step in Python Matplotlib
How to log in to Docker + NGINX
Determine the date and time format in Python and convert to Unixtime
How to give and what the constraints option in scipy.optimize.minimize is
[Python] How to get the first and last days of the month
How to format a table using Pandas apply, pivot and swaplevel
[ROS2] How to describe remap and parameter in python format launch
How to visualize where misclassification is occurring in data analysis classification
Format the Git log and get the committed file name in csv format
How to display bytes in the same way in Java and Python
[RHEL7 / CentOS7] I put in the log monitoring tool swatch and tried to notify by email
How to divide and process a data frame using the groupby function
How to write the correct shebang in Perl, Python and Ruby scripts
[Itertools.permutations] How to put permutations in Python
How to get the date and time difference in seconds with python
How to reassign index in pandas dataframe
How to put a half-width space before letters and numbers in Python.
Data analysis: Easily apply descriptive and inference statistics to CSV data in Python
How to convert DateTimeField format in Django
How to copy and paste the contents of a sheet in Google Spreadsheet in JSON format (using Google Colab)
How to read CSV files in Pandas
How to use is and == in Python
How to use pandas Timestamp and date_range
(Diary 1) How to create, reference, and register data in the SQL database of Microsoft Azure service with python
I want to visualize where and how many people are in the factory
Data analysis in Python Summary of sources to look at first for beginners
How to pass the path to the library built with pyenv and virtualenv in PyCharm
How to count the number of elements in Django and output to a template
Why Docker is so popular. What is Docker in the first place? How to use
How to replace with Pandas DataFrame, which is useful for data analysis (easy)
[Pandas] How to check duplicates and delete duplicates in a table (equivalent to deleting duplicates in Excel)
How to make the font width of jupyter notebook put in pyenv equal width
Scraping in Python-Introduction to Scrapy First 2nd Step
How to use the C library in Python
Log in to the remote server with SSH
How to use MkDocs for the first time
How to generate permutations in Python and C ++
[Python] How to change the date format (display format)
12. Save the first column in col1.txt and the second column in col2.txt
The first step in the constraint satisfaction problem in Python
[Python] How to read data from CIFAR-10 and CIFAR-100
How to use data analysis tools for beginners
[Introduction to Python] How to handle JSON format data
How to get the files in the [Python] folder
How to read time series data in PyTorch