Even if you try to analyze access logs such as Apache and Tomcat, there is no surprisingly good tool. I would like to have a tool that supports various output formats, filters only the necessary information, and makes it easy to understand and visualize, but I can't find it easily.
So, in the world of data analysis, I tried using Pandas and Matplotlib, which are standard in the world of data analysis, to see if it is easy to analyze and visualize Apache access logs on Jupyter Notebook.
First, import the minimum required Pandas and Matplotlib.
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
Set the environment according to your preference.
#Graphs etc. are drawn in Notebook
%matplotlib inline
#Changed the maximum string length of DataFrame columns from the default of 50 to 150
pd.set_option("display.max_colwidth", 150)
For how to load the access log, I referred to this blog entry. Before loading the access log, define the required type analysis functions.
from datetime import datetime
import pytz
def parse_str(x):
return x[1:-1]
def parse_datetime(x):
dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
Then load the access log with pd.read_csv ()
. The output format of the access log I had was slightly different, so I modified it as follows. This time, for the sake of simplicity, only some columns are extracted.
df = pd.read_csv(
'/var/log/httpd/access_log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
na_values='-',
header=None,
usecols=[0, 4, 6, 7, 8, 9, 10],
names=['ip', 'time', 'response_time', 'request', 'status', 'size', 'user_agent'],
converters={'time': parse_datetime,
'response_time': int,
'request': parse_str,
'status': int,
'size': int,
'user_agent': parse_str})
After loading the access log, let's analyze it.
First, check the first and last 5 lines of the file.
df
You can see that there were 114,004 requests in about 1 hour and 20 minutes from 18:20 on March 24th.
Let's display the average and maximum response time.
print('Response time aggregation result (microseconds)\n')
print('minimum value: {}'.format(str(df['response_time'].min()).rjust(10)))
print('Average value: {}'.format(str(round(df['response_time'].mean())).rjust(10)))
print('Median: {}'.format(str(round(df['response_time'].median())).rjust(10)))
print('Maximum value: {}'.format(str(df['response_time'].max()).rjust(10)))
print('\n Worst response time 15')
df.sort_values('response_time', ascending=False).head(15)
The 15 worst response times were all requests to the OpenAM authentication service, and it took about 15 seconds to return a response with HTTP status 406. You can see that there was a problem with the application that works with OpenAM.
Looking at the worst 15 response times, size
(response size) and ʻuser_agent (user agent) are
NaN`. Let's check how many missing values there are.
df.isnull().sum()
If size
is NaN
, it is a redirect. Also, about 30% of ʻuser_agent` is unknown. Since it's not just the end-user's browser that accesses OpenAM, it's likely that many of them don't have a "User-Agent" header. Let's find out what other user agents are available.
ua_df = DataFrame(df.groupby(['user_agent']).size().index)
ua_df['count'] = df.groupby(['user_agent']).size().values
ua_df
There were 490 types. It seems that this is the result because not only the end user but also many applications are working together.
I haven't visualized it at all so far, so I think it wasn't interesting. Let's display it as a pie chart. First, let's draw the percentage of the response status code in a pie chart.
plt.figure(figsize = (5, 5))
labels = ['2xx', '3xx', '4xx', '5xx']
plt.pie(df.groupby([df['status'] // 100]).size(), autopct = '%1.1f%%', labels = labels, shadow = True, startangle = 90)
plt.axis('equal')
df.groupby([df['status'] // 100]).size()
plt.show()
Hmmm, the low percentage status code labels overlap and don't look good ...
So, let's define a function that groups a small number of elements (1% or less) into "others". If you use Matplotlib's features, you may not need these functions, but I couldn't find them, so I made a simple one.
#For DataFrame
def replace_df_minors_with_others(df_before, column_name):
elm_num = 1
for index, row in df_before.sort_values([column_name], ascending=False).iterrows():
if (row[column_name] / df_before[column_name].sum()) > 0.01:
elm_num = elm_num + 1
df_after = df_before.sort_values([column_name], ascending=False).nlargest(elm_num, columns=column_name)
df_after.loc[len(df_after)] = ['others', df_before.drop(df_after.index)[column_name].sum()]
return df_after
#For dictionaries
def replace_dict_minors_with_others(dict_before):
dict_after = {}
others = 0
total = sum(dict_before.values())
for key in dict_before.keys():
if (dict_before.get(key) / total) > 0.01:
dict_after[key] = dict_before.get(key)
else:
others = others + dict_before.get(key)
dict_after = {k: v for k, v in sorted(dict_after.items(), reverse=True, key=lambda item: item[1])}
dict_after['others'] = others
return dict_after
Now let's use this function to display the user agent types in a pie chart.
plt.figure(figsize = (15, 10))
ua_df_with_others = replace_df_minors_with_others(ua_df, 'count')
plt.pie(ua_df_with_others['count'], labels = ua_df_with_others['user_agent'], autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.title('User Agent')
plt.show()
Even if you display the "User-Agent" header directly, it's hard to understand. What is "Mozilla / 5.0 (Windows NT 6.1; WOW64; Trident / 7.0; rv: 11.0) like Gecko" ...
So, let's convert it to an easy-to-understand display using a library called Woothee. Install with the following command.
$ pip install woothee
Let's use this to display the client OS and user agent in a pie chart.
import woothee
ua_counter = {}
os_counter = {}
for index, row in ua_df.sort_values(['count'], ascending=False).iterrows():
ua = woothee.parse(row['user_agent'])
uaKey = ua.get('name') + ' (' + ua.get('version') + ')'
if not uaKey in ua_counter:
ua_counter[uaKey] = 0
ua_counter[uaKey] = ua_counter[uaKey] + 1
osKey = ua.get('os') + ' (' + ua.get('os_version') + ')'
if not osKey in os_counter:
os_counter[osKey] = 0
os_counter[osKey] = os_counter[osKey] + 1
plt.figure(figsize = (15, 10))
plt.subplot(1,2,1)
plt.title('Client OS')
os_counter_with_others = replace_dict_minors_with_others(os_counter)
plt.pie(os_counter_with_others.values(), labels = os_counter_with_others.keys(), autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.subplot(1,2,2)
plt.title('User Agent')
ua_counter_with_others = replace_dict_minors_with_others(ua_counter)
plt.pie(ua_counter_with_others.values(), labels = ua_counter_with_others.keys(), autopct = '%1.1f%%', shadow = True, startangle = 90)
plt.show()
"UNKNOWN" has increased, but has it become easier to understand? Even so, many people still use IE on Windows ...
Next, let's look at the percentage of the status code (400 or more) of the response that caused an error.
error_df = df[df['status'] >= 400]
plt.figure(figsize = (10, 10))
labels = ['4xx', '5xx']
plt.pie(error_df.groupby([error_df['status'] // 100]).count().time, labels=labels, counterclock=False, startangle=90)
labels2 = ['400', '401', '403', '404', '406', '500', '501', '502']
plt.pie(error_df.groupby(['status']).count().time, labels=labels2, counterclock=False, startangle=90, radius=0.7)
centre_circle = plt.Circle((0,0),0.4, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.title('Error Status Code')
plt.show()
The characteristic of this access log is that there are many error responses with status code 406, which is generally unfamiliar.
Now, let's output a different graph. First, check the load status.
plt.figure(figsize = (15, 5))
access = df['request']
access.index = df['time']
access = access.resample('S').count()
access.index.name = 'Time'
access.plot()
plt.title('Total Access')
plt.ylabel('Access')
plt.show()
There is constant access, sometimes over 100 per second.
Let's see if there is any relationship between the size of the response and the time.
plt.figure(figsize = (15, 5))
plt.title('size v.s. response_time')
plt.scatter(df['size']/1000, df['response_time']/1000000, marker='.')
plt.xlabel('Size(KB)')
plt.ylabel('Response Time')
plt.grid()
Although there is no clear relationship, if the size of the response is not 0, it can not be said that the larger the size, the less time it tends to take.
I think Pandas and Matplotlib can be used to analyze access logs. If you can master these libraries, you will be able to fully utilize them for troubleshooting.
I'm still in the prototype stage, but I'm also committed to GitHub.
Recommended Posts