Processing datasets with pandas (1)

Extraction of sample data

Data preprocessing is essential for statistical analysis. First, the data is read so that it can be handled by a computer, but the turnaround is often a problem in the calculation process that handles large data. There are several steps to take in such cases.

Reduce the size of the data
Identify bottlenecks and reduce computational complexity
Improve the performance of the computer

It's been a long time since it was called big data, but in reality it is not necessary to increase the size of the sample. Let's extract a significant sample by Sampling method.

I / O is the bottleneck for many centralized data processes. At this time, it is better to consider reading only the necessary data, or dividing the original data appropriately to reduce the input size itself.

Slice and aggregate sample data

Slicing

Slicing is easy when working with data in pandas.

#Extract data up to 30 years old
data_y = data[:"30"]
#Extract data over 31 years old
data_o = data["31":]

You can also merge the datasets this way (http://qiita.com/ynakayama/items/358a7e043194cecf28f9).

Aggregate

This is an example of aggregating monthly data into quarterly data using the period average.

data.resample('Q',how="mean")

"sum", "mean", "median", "max", "min", "last", "first" are available for how.

Handling of missing values

Data sets are not always neatly organized. pandas adds various idioms cultivated by people in the field in handling missing values.

Fill in the holes

data.fillna(0)

In the above example, the missing value is replaced with 0. If you use data.fillna (data.mean ()) etc., it will be filled with the average value.

Specify method = "ffill" to fill with the value immediately following.

data.fillna(method='ffill')

Also, the values before and after the missing value are Linear interpolation It's easy to do.

data.interpolate()

Often you will want to delete data that contains missing values. Remove as follows.

data.dropna(axis=0) #Line axis=0 or column axis=1

Add and replace data

Add a new column called data ['New'].

data['New']=rand(data.shape[0])

Also this time add a row. You can add it by specifying a data frame in the .append () function.

data = data.append(pd.dataFrame([1,2,3,4,5],columns=["A","B","C","D","E"],index=data[-1:].index+1))

You can overwrite it by passing the data you want to replace to data.iloc. Since data.shape represents the number of matrices in the data frame, random numbers can be overwritten by generating and substituting random numbers for the number of matrices.

#Overwrite the first line with a random number
data.iloc[0]=rand(data.shape[1])
#Overwrite the first column with a random number
data.iloc[:,0]=rand(data.shape[0])

To sort the data, pass a list of column names to the .sort () function. In the following example, the first column is prioritized and the columns up to the second column are sorted in ascending order. The result is returned to the receiver.

data.sort(columns=list(data.columns[0:2]),ascending=True)

Summary

We have summarized useful processes when processing datasets using pandas.