How to manipulate data in Pandas, which is essential for handling data analysis in Python I summarized the basics.
From important grammar that you forget about, we have included some tips.
Recommended for people like this → I want to touch Pandas for the first time! → Try to use R in Python. → I can't remember the grammar of Pandas-it would be convenient if there was a list somewhere ... → How much data handling can be done with Python in the first place?
Please also match this ◆ Data manipulation with Pandas: Use Pandas_ply http://qiita.com/hik0107/items/3dd260d9939a5e61c4f6
First of all, import Pandas and create data in data frame format appropriately
data_creation.py
import pandas as pd
df_sample =\
pd.DataFrame([["day1","day2","day1","day2","day1","day2"],
["A","B","A","B","C","C"],
[100,150,200,150,100,50],
[120,160,100,180,110,80]] ).T #For the time being, create appropriate data
df_sample.columns = ["day_no","class","score1","score2"] #Give a column name
df_sample.index = [11,12,13,14,15,16] #Give an index name
◆Column / Index Access Access specific columns and index numbers
col_index_access.py
df_sample.columns #Get column name
df_sample.index #Get index name
df_sample.columns = ["day_no","class","point1","point2"] #Overwrite column name
df_sample.index = [11,12,13,14,15,16] #Overwrite index name
#Use Rename method
df_sample.rename(columns={'score1': 'point1'}) #I will put the correspondence in a dictionary type
Take a look at the data overview
datacheck.py
#Check the number of lines
len(df_sample)
#Confirmation of the number of dimensions
df_sample.shape #Returns in the form (number of rows, number of columns)
#List of column information
df_sample.info() #List of column names and their types
#Confirmation of basic statistics for each column
#Summary in R()
df_sample.describe() #Mean, variance, quartile, etc.
# head / tail
df_sample.head(10) #Check the first 10 lines
df_sample.tail(10) #Check the first 10 lines
Select only specific columns from the data
datacheck.py
#Built-in functions__get_item___Selection using
df_sample["day_no"] #Write and specify the column name
df_sample[["day_no","score1"]]# Use list comprehension when selecting multiple columns
#Column selection using loc
#Grammar: iloc[rows, columns]Write in the form of
#You can subset not only columns but also rows at the same time
df_sample.loc[:,"day_no"] #The line is "to select all":"Is put.
df_sample.loc[:,["day_no","score1"]]# Use list comprehension when selecting multiple columns
#Column selection using iloc
#Grammar: iloc[rows number,columns number]Write in the form of
df_sample.iloc[:,0] #Select by number
df_sample.iloc[:,0:2] #In case of multiple serial numbers. You can also go in list comprehension
#Column selection using ix
#Both column names and column numbers can be used. Basically it feels good to use this
df_sample.ix[:,"day_no"] #In the case of single column selection, the result is Pandas.Series Object
df_sample.ix[:,["day_no","score1"]] #In case of multi-column selection, the result is Pandas.Become a Dataframe
df_sample.ix[0:4,"score1"] #Rows can be selected by number and columns can be selected by column name
series_bool = [True,False,True,False]
df_sample.ix[:,series_bool] #You can also select a Boolean array
#Select by partial match of column name
#Select for R Dplyr(Contains()), A convenient scheme for partial match selection of column names
#Pandas doesn't have that feature, so you'll have to take a few steps.
score_select = pd.Series(df_sample.columns).str.contains("score") # "score"Logical judgment of whether to include in the column name
df_sample.ix[:,np.array(score_select)] #Column selection using logical arrays
◆Subsetting Partial selection of data based on conditional statements
subsetting.py
##Python default notation
##Data frame[Put an array of Boolean]
df_sample[df_sample.day_no == "day1"] # day_Select only data whose no column is day1
series_bool = [True,False,True,False,True,False]
df_sample[series_bool] #Of course, you can use other than the columns of the data frame itself as conditions
##Notation when using Pandas query method
df_sample.query("day_no == 'day1'")
#It's neat because you don't have to write the data frame name twice.
#Note that the conditional expression must be entered in Str format
df_sample.query("day_no == 'day1'|day_no == 'day2'")
#In case of multiple conditions, or condition"|"Or and of the condition"&"I'll put it in between
select_condition = "day1"
df_sample.query("day_no == select_condition") # ☓ doesn't work
#Since the conditional expression of extraction is str notation, it does not respond if you enter the variable name directly
df_sample.query("day_no == @select_condition") # ◯ it works
#If you want to use a variable, put it in the variable name@If you add, it will be recognized as a variable name
##Subsetting using index
df_sample.query("index == 11 ") #If you write index normally, it will work
df_sample.query("index in [11,12] ") #"In" can also be used for the or condition
◆Sorting Sorts the data.
sorting.py
df_sample.sort("score1") #Sort by Score1 value in ascending order
df_sample.sort(["score1","score2"]) #Sort by Score1 and Score2 values in ascending order
df_sample.sort("score1",ascending=False) #Sort by the value of score1 in descending order
◆pandas.concat Add records and columns by combining data.
concat.py
#Add line
#Create the data you want to add. Consider combining data frames.
#df_Let's assume that you want to add a record with index "17" to sample.
df_addition_row =\
pd.DataFrame([["day1","A",100,180]]) #df_Create a DF with the same column structure as sample
df_addition_row.columns =["day_no","class","score1","score2"] #Give the same column name
df_addition_row.index =[17] #Shake the index
pd.concat([df_sample,df_addition_row],axis=0) #Make a join=rbind
#First argument: DF to combine[]Specify by notation.
#Second argument: Axis=0 specifies that it is a vertical join.
#Add column
#Consider adding a Score3 column in addition to Score1 and Score2.
#Create the data you want to add. Consider combining data frames.
df_addition_col =\
pd.DataFrame([[120,160,100,180,110,80]]).T #df_Create a DF with the same number of lines as sample
df_addition_col.columns =["score3"] #Column names are used as is after joining
df_addition_col.index = [11,12,13,14,15,16]
#Caution! !! pandas.concat will not work as expected unless the indexes of the joins have the same structure! (See below)
pd.concat([df_sample,df_addition_col],axis=1) #axis=1 specifies a horizontal join.
#About the index
#If the index of the new data is different from where it was joined, the data will be joined in a staggered manner.
#Please try the following
df_addition_col =\
pd.DataFrame([[120,160,100,180,110,80]]).T
df_addition_col.columns =["score3"]
df_addition_col.index = [11,12,13,21,22,23] #Some have the same index as the original data, but some do not
pd.concat([df_sample,df_addition_col],axis=1) #Result is....
◆Joining Combines two data based on a certain Key.
join.py
##In the process of creation
◆ Basic summary of data manipulation in Python Pandas-Second half: Data aggregation http://qiita.com/hik0107/items/0ae69131e5317b62c3b7
Recommended Posts