First, let's take a look at the usefulness of Pandas and Numpy.
-** Numpy handles multidimensional data ** --Numpy targets numeric data in multidimensional arrays, and there are few functions that handle other data types. ――The calculation speed is quite fast, and when it is converted to Cython (converted to C / C ++ and compiled), it becomes as fast as C language. -** Pandas handles real data other than multidimensional ** --For real data other than multidimensional arrays, it is suitable for input / output and processing of data stored in CSV, SQL, and Excel. --It has a function to process all kinds of data, not only numerical data but also time series data or character strings.
Therefore, it can be said that Pandas and Numpy are often combined step by step and used in the following series of flows, for example.
data = [12, 3, 5, 2, 6, 7, 9, 6, 4, 11]
I created a standard Python list by writing the values directly inside variable name = []
, separated by commas.
Below, we will calculate various statistics using Numpy.
np.mean(data)
Numpy's mean
function calculated an average of 6.5.
np.median(data)
Numpy's median
function calculated a median of 6.0.
When the data is sorted in order of size, the value located exactly in the center is also called the median. If it is equal to the second quartile and there are multiple data, it will be the average of two values close to the center.
np.sum(data)
Numpy's sum
function calculated a total value of 65.
np.std(data)
Numpy's std
function gave the result 3.138470965295043.
However, this is the value of the ** population standard deviation **.
Note that there are two types of standard deviation, the population standard deviation and the ** unbiased standard deviation **.
The entire subject of research or research is called the ** population **, and the part extracted from the population is called the ** sample **. Surveys that examine the entire population are called ** 100% surveys ** or ** all-out surveys **. Typical examples are the "Census", which requires all people living in Japan to answer, and the "Census of Business Establishments / Companies," which can be called the national census of business establishments. Probably. In other words, almost all data handled in the real world can be said to be samples from sample surveys. However, I don't want to know the characteristics and tendencies of the sample at all, and I always try to estimate the characteristics and tendencies of the population while targeting the samples. Now, we need two statistics to calculate the standard deviation. First, calculate the mean, use it to calculate the variance, and take the square root of the variance to get the standard deviation. The mean, variance, and standard deviation calculated from the sample are prefixed with "sample" or "unbiased", respectively, and ** sample mean $ \ bar {X} $ **, ** unbiased variance $ s ^ 2 They are called $ ** and ** unbiased standard deviation $ s $ **. On the other hand, the mean, variance, and standard deviation of the population estimated from the sample are ** population mean $ μ $ **, ** population variance $ σ ^ 2 $ **, ** population standard deviation $ σ. We call it $ ** to distinguish it.
So, if you want to use Numpy to find the unbiased standard deviation, do the following:
np.std(data, ddof=1)
Let's compare it with the standard deviation calculated using Pandas earlier.
df = pd.DataFrame(data) #Convert data to Pandas dataframe
df.describe().loc['std']
The standard deviation by Pandas is an unbiased standard deviation.
Next, let's calculate the basic statistics using the Python standard library statistics.
Recommended Posts