Processing datasets with pandas (1)

Extraction of sample data

Data preprocessing is essential for statistical analysis. First, the data is read so that it can be handled by a computer, but the turnaround is often a problem in the calculation process that handles large data. There are several steps to take in such cases.

It's been a long time since it was called big data, but in reality it is not necessary to increase the size of the sample. Let's extract a significant sample by Sampling method.

I / O is the bottleneck for many centralized data processes. At this time, it is better to consider reading only the necessary data, or dividing the original data appropriately to reduce the input size itself.

Slice and aggregate sample data

Slicing

Slicing is easy when working with data in pandas.

#Extract data up to 30 years old
data_y = data[:"30"]
#Extract data over 31 years old
data_o = data["31":]

You can also merge the datasets this way (http://qiita.com/ynakayama/items/358a7e043194cecf28f9).

Aggregate

This is an example of aggregating monthly data into quarterly data using the period average.

data.resample('Q',how="mean")

"sum", "mean", "median", "max", "min", "last", "first" are available for how.

Handling of missing values

Data sets are not always neatly organized. pandas adds various idioms cultivated by people in the field in handling missing values.

Fill in the holes

data.fillna(0)

In the above example, the missing value is replaced with 0. If you use data.fillna (data.mean ()) etc., it will be filled with the average value.

Specify method = "ffill" to fill with the value immediately following.

data.fillna(method='ffill')

Also, the values before and after the missing value are Linear interpolation It's easy to do.

data.interpolate()

Often you will want to delete data that contains missing values. Remove as follows.

data.dropna(axis=0) #Line axis=0 or column axis=1

Add and replace data

Add a new column called data ['New'].

data['New']=rand(data.shape[0])

Also this time add a row. You can add it by specifying a data frame in the .append () function.

data = data.append(pd.dataFrame([1,2,3,4,5],columns=["A","B","C","D","E"],index=data[-1:].index+1))

You can overwrite it by passing the data you want to replace to data.iloc. Since data.shape represents the number of matrices in the data frame, random numbers can be overwritten by generating and substituting random numbers for the number of matrices.

#Overwrite the first line with a random number
data.iloc[0]=rand(data.shape[1])
#Overwrite the first column with a random number
data.iloc[:,0]=rand(data.shape[0])

To sort the data, pass a list of column names to the .sort () function. In the following example, the first column is prioritized and the columns up to the second column are sorted in ascending order. The result is returned to the receiver.

data.sort(columns=list(data.columns[0:2]),ascending=True)

Summary

We have summarized useful processes when processing datasets using pandas.

Recommended Posts

Processing datasets with pandas (1)
Processing datasets with pandas (2)
Merge datasets with pandas
Data processing tips with Pandas
Quickly try to visualize datasets with pandas
Example of efficient data processing with PANDAS
Image processing with MyHDL
Quickly visualize with Pandas
Bootstrap sampling with Pandas
Convert 202003 to 2020-03 with pandas
Draw a graph by processing with Pandas groupby
Learn Pandas with Cheminformatics
Data visualization with pandas
Data manipulation with Pandas!
Image processing with Python
Parallel processing with multiprocessing
Shuffle data with pandas
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
Image Processing with PIL
Process csv data with python (count processing using pandas)
Image processing with Python (Part 2)
100 Language Processing with Python Knock 2015
Read csv with python pandas
Load nested json with pandas
Parallel processing with local functions
Image processing with PIL (Pillow)
"Apple processing" with OpenCV3 + Python3
Acoustic signal processing with Python (2)
Acoustic signal processing with Python
[Python] Change dtype with pandas
Parallel processing with Parallel of scikit-learn
Image processing with Python (Part 1)
Image processing with Python (Part 3)
Standardize by group with pandas
Prevent omissions with pandas print
[Python] Image processing with scikit-image
Study natural language processing with Kikagaku
[Python] Easy parallel processing with Joblib
Extract the maximum value with pandas.
Pandas basics for beginners ⑧ Digit processing
100 Language Processing Knock with Python (Chapter 1)
[Natural language processing] Preprocessing with Japanese
Pandas
Try audio signal processing with librosa-Beginner
100 Language Processing Knock with Python (Chapter 3)
Versatile data plotting with pandas + matplotlib
Image processing with Python 100 knocks # 3 Binarization
[Python] Join two tables with pandas
Path processing with takewhile and dropwhile
Dynamically create new dataframes with pandas
Extract specific multiple columns with pandas
1. Statistics learned with Python 1-1. Basic statistics (Pandas)
Convenient analysis with Pandas + Jupyter notebook
100 Language Processing Knock-31 (using pandas): Verb
Draw a graph with pandas + XlsxWriter
Manipulating strings with pandas group by
Bulk Insert Pandas DataFrame with psycopg2
I want to do ○○ with Pandas
Create an age group with pandas
Image processing with Python 100 knocks # 2 Grayscale
Arithmetic processing with Chinese numeral class