Pandas basic summary
About Series and DataFrame
Series
What is Series? A list of one-dimensional values
When a dict type object is put in Series, key is expressed as index.
data = {
"Name":"Jhon",
"Sex":"male",
"AGe":22
}
pd.Series(data)
>
Name Jhon
Sex male
AGe 22
dtype: object
Create Series from Numpy array
array = np.array([22,31,42,23])
age_series = pd.Series(array)
age_series
Specify index in array and call by index
array = np.array(['John','male',22])
john_series = pd.Series(array,index = ['Name','Sex','Age'])
john_seiies["Name"]
>John
john_seiries
>
Name John
Sex male
Age 22
dtype: object
Get the original Numpy array
age_series.values.values
>array([22, 31, 42, 23])
DataFrame
As an image, the matrix itself is treated as a table (row Series, column Series), and the combination is like a DataFrame.
In the figure above, only the column Series, Also handles Series in rows
Created from Numpy array
ndarray = np.arange(10).reshape(2,5)
ndarray
>
array([[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9]])
pd.DataFrame(ndarray,index = ["index1",'index2'] ,columns = ['a','b','c','d','e' ])
>
| a | b | c | d | e |
| index1 | 0 | 1 | 2 | 3 | 4 |
| index2 | 5 | 6 | 7 | 8 | 9 |
Basic flow 1 Read with read_csv 2 Analyze basic data information
df = pd.read_csv("dataset/tmdb_5000_movies.csv")
# len()Check the number of data with
len(df)
When you want to display the list without omitting it
#Remove colomu restrictions
pd.set_option('display.max_columns',None)
#Eliminate the restrictions on rows (each data) (* Note that it will be heavy)
pd.set_option('display.max_rows',None)
df.describe()
type(df) #describe itself can be treated as a DataFrame
Returned in Series
df["Column name"]○ Recommended
df.Column name ▲ Not recommended
Returned by DataFrame
df[["revenue"]]
# Colum can be selected multiple times
df[["revenue","original_title","budget"]]
#Specify the index of a specific row and retrieve it
df.iloc[10:13]
#Specify the index of a specific row and retrieve the specified column
df.iloc[10:13]["original_title"]
Delete row / column
drop() #The original dataframe remains unchanged
Change the original DataFrame with inplace = True
<Delete specific lines at once axis=0 (* Specified by default)>
df.drop('id', (axis = 0) ,(inplace = True))
<Delete the specified column axis= 1>
df.drop('id', axis = 1,(inplace = True))
df = df.drop(5) #A method to update the original data, which is more major than inplace! Reuse the same variables
dropna()Delete all missing values
np.isnan()Determine if there is nan (missing value)
fillna()Fill in missing values
>fillna(df["runtime"].mean())
Filter
How to filter
#Example) I want to specify only Japanese movies
j_movie = df[df['original_language'] == 'ja'] #This way of writing is basically often used
()&()Or()|()Enter multiple conditions with
#Example) I want to specify only Japanese movies with a rating of 8 or higher.
j_movie = df[(df['original_language'] == 'ja') & (df["vote_average"] >= 8 ) ]
df[ (df['budget'] == 0 ) | (df['revenue'] == 0 ) ]
→ Filter: "Budget or sales are 0"
df[ ~ ((df['budget'] == 0 ) | (df['revenue'] == 0 )) ]
Filter: "Budget or sales is not 0" (NOT operation ~)
Argument how options
df1 = pd.DataFrame({'key':["k0","k1","k2"],
'A':["a0","a1","a2"],
'B':["b0","b1","b2"]})
df2 = pd.DataFrame({'key':["k0","k1","k2"],
'C':["c0","c1","c2"],
'D':["d0","d1","d2"]})
Recommended Posts