Processing datasets with pandas (2)

Yesterday explained about dataset processing with pandas, but it is a continuation.

Normalize the data

Actually, articles so far also casually appeared normalization, but I think I didn't explain it properly.

** Normalize ** in statistics is to transform data of different criteria according to certain criteria to make it easier to use.

For example, let's say you have 90 points in Japanese and 70 points in math. If you simply compare the numbers, you will get better grades in the national language, but what if the average score in the national language is 85 points and the average score in mathematics is 55 points? The advantage of normalization is that you can compare data with different criteria in this way.

Generally, it means converting the values so that the mean is 0 and the variance (and standard deviation) is 1.

This can be calculated with the following formula.

Normalized(A(n)) = \frac {(A(n) - μ(A))} {\sigma(A)}

That is, subtract the mean and divide by the standard deviation. This results in a mean of 0 and a standard deviation of 1.

Visualize normalization

It's best to move your hands and see everything with your eyes. Let's do the same with pandas.

First, divide the data frame by the total value in the column direction and normalize it so that the total sum is 1.

data.div(data.sum(1), axis=0)

Normalize in the interquartile range

(data - data.quantile(0.5).values) / (data.quantile(0.75)-data.quantile(0.25)).values

Logarithmic conversion

Logarithmic conversion is to create a variable that follows a normal distribution by taking the logarithm of the variable that follows a lognormal distribution. That is.

Logarithmic conversion makes it easy to organize and express decimal numbers and huge numbers.

It may be easier to understand if it is expressed in code.

data.apply(np.log)

Find the migration rate

The movement rate (increase rate) is a numerical value that indicates how much the movement has changed with respect to a certain standard value.

pct_change () converts the data frame values to migration rates. The point to keep in mind is that the first number has no front, so the migration rate is NaN. The migration rate also casually appeared in Previous article.

data.T.pct_change().dropna(axis=0)

As I introduced yesterday, you can make a table by deleting missing values. However, it is a little confusing because the first value of the graph becomes large.

Save IPython work history

It is not directly related to the processing of the dataset, but it would be nice to be able to output the results of the IPython trials to a file and save them. If you have made the correct trial, you can use it as a script as it is, and it will be more reusable, such as extracting the code from the work history.

import readline
readline.write_history_file("history.py")

This saves the history of the code you type into IPython as history.py. It's very convenient.

Summary

This time as well, we have summarized various processes that are often used when processing datasets.