How do you get a bird's eye view of the whole thing in the face of simple and large amounts of data, such as large log data?
First, consider the following.
The scale level was explained in Understanding the types of data and the beginning of linear regression.
For example, even if you take time, which is one of the types of continuous data, you can visualize the transition with time by setting an interval and obtaining a value, such as every minute or every hour. At this time, scale level conversion is required. In addition, dimensionless numbering is performed to facilitate comparison of values of different criteria such as ratios and ratios by converting to numerical values without units (unnamed numbers). Normalization is the transformation and processing of data according to certain rules. For example, by reducing the variance to a value with 1 mean as 0, you can compare numbers on different axes.
The summary statistics are explained in Statistics and Interval Estimates. Averages and variances are typical statistics, but for example, by quantifying what the value is in the middle of the data and how it is scattered, it is possible to summarize a large amount of data and get an overview. ..
For example, If you look at the securities report, you can see the average annual income of each company in the IT industry. However, in reality, if there are no summary statistics such as variance and median as well as average annual income, it is about the same whether it is the average with a huge difference between the top and bottom, or whether only some people have high annual income. I don't know if it's the amount of money. In this way, even if the summary statistic is scan all, [sample from the population to obtain the statistic](http: / /qiita.com/ynakayama/items/4362c439d9ea814cbe60) This is also indispensable information for Check the goodness of fit of the distribution of the target data.
Hypothesis test and probability distribution explained how to obtain basic statistics. However, I think that the current situation is that it is difficult for the general public to report summary statistics in data analysis. A useful visualization method is "Box plot" is.
First, let's generate a random number according to the Poisson distribution and find the statistic. Since the procedure has been explained so far, detailed explanation is omitted.
import numpy as np
import pandas as pd
s1 = pd.Series(np.random.poisson(5, 10000))
s1.describe()
#=>
# count 10000.000000
# mean 5.026600
# std 2.211421
# min 0.000000
# 25% 3.000000
# 50% 5.000000
# 75% 6.000000
# max 14.000000
#dtype: float64
s2 = pd.Series(np.random.poisson(5, 10000))
s3 = pd.Series(np.random.poisson(5, 10000))
The "box" in the box plot contains half of the total data (25% to 75%). Since these data form an intermediate layer in the population, they can be regarded as "a collection of data representing the population". And if you say "list one of the most common data" in the box, it is the "median" that represents all the data.
Now we have generated three vectors with a mean of 5 and a dimensional space of 10000. You can plot this using the boxplot function.
import matplotlib.pyplot as plt
ax.boxplot([s1, s2, s3])
xticks = ['A', 'B', 'C', ]
plt.xticks([1, 2, 3], xticks)
plt.grid()
plt.ylabel('Length')
plt.xlabel('type')
plt.show()
plt.savefig("image.png ")
In this way, we have a boxplot with a median of almost 5.
For the sake of clarity, I'll overlay the scatter plot with the plot.
#Get a zero matrix with all zero components of the same length as s1
s0 = pd.Series([0] * len(s1))
ax.plot([s0, s1, s2, s3], marker='.', linestyle='None', )
We found that summary statistics can be used to get a bird's eye view of the entire data, and box plots can be used to visualize it.