Various terms are used in statistics. There are many things that are very familiar, such as averages and deviations, to things that you are not familiar with. First of all, I would like to start by understanding the meaning of basic terms correctly. (As a general rule, write the code and check the result on Google Colaboratory)
import numpy as np
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/karaage0703/machine-learning-study/master/data/karaage_data.csv")
The csv file specified in ("filename") is read using the read_csv
function of pandas and stored in the variable df.
df.head()
Only the first 5 lines of data stored in the variable df by the head
function are displayed.
You can see that the data consists of two variables, x and y.
df.describe()
The pandas describe
function gets a list of basic statistics.
Now, with the term ** statistics **, we call the aggregated value of the data that way. By looking at the statistics, you can see the characteristics of the sample. Let's check the eight statistics shown as basic statistics and the meaning of each term.
Statistics | Fluent x | Fluent y | Meaning of terms | |
---|---|---|---|---|
count | Number of specimens | 6 | 6 | n=Contains 6 or 6 lines of data in total |
mean | Average value | 14.33 | 3.33 | Used as a so-called representative value (value representing a sample) |
std | standard deviation | 16.01 | 1.51 | Abbreviation for standard deviation, one of the statistics that shows how much the data varies. |
min | minimum value | 1.00 | 2.00 | The smallest value in the variate |
25% | 1st quartile | 2.75 | 2.25 | When the data is sorted in ascending order, the number of data is counted from the smallest to the first quarter. |
50% | Second quartile | 7.50 | 3.00 | When the data is sorted in ascending order, the value corresponding to the second quarter of the number of data counted from the smallest |
75% | Third quartile | 23.50 | 3.75 | When the data is sorted in ascending order, the number of data is the third quarter from the smallest. |
max | Maximum value | 40.00 | 6.00 | The largest value in that variate |
First, let's calculate the average.
df.describe().loc['mean']
Next, calculate the standard deviation and the first quartile by specifying the statistic in loc ['xxx']
.
df.describe().loc['std']
df.describe().loc['25%']
So far, we've used Pandas to look at basic statistics. Next, let's try to calculate various statistics using Numpy, and consider the basic calculation method and characteristics of the statistics.
Recommended Posts