This time I will summarize how to use pandas.
Many people have summarized how to use pandas, so it may not be new, but I would appreciate it if you could get along with me.
The previous article summarized how to use numpy, so please check it if you like.
I tried to summarize python numpy
You can generate a Series by doing the following. Series is an array with an index attached.
import numpy as np
import pandas as pd
series = pd.Series(data=[1, 2, 3, 4, 5], index=['A', 'B', 'C', 'D', 'E'])
A 1 B 2 C 3 D 4 E 5 dtype: int64
It can also be generated in combination with numpy.
series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
A 0 B 1 C 2 D 3 E 4 dtype: int64
Series can be indexed to retrieve data. It's close to a dictionary type.
series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
You can also use slices.
series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
A 0 B 1 C 2 D 3 dtype: int64
In the sense of a normal slice, the data up to C, which is one before D, should be extracted, but in the case of Series, it is extracted to the range specified by the index.
However, when retrieving data by specifying an index in this way, the loc method is customarily used.
series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
A 0 B 1 C 2 D 3 dtype: int64
You can specify two indexes without using slices.
series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
print(series.loc[['A', 'D']])
A 0 D 3 dtype: int64
Instead of using the Series index, you can also specify the index of the number assigned from the beginning and retrieve it.
series = pd.Series(data=np.arange(5), index=['A', 'B', 'C', 'D', 'E'])
A 0 dtype: int64
You can generate a DataFrame by doing the following.
df = pd.DataFrame(data=[[1, 2, 3], [4, 5, 6], [7, 8, 9]], index=['A', 'B', 'C'], columns=['A1', 'A2', 'A3'])
A1 A2 A3 A 1 2 3 B 4 5 6 C 7 8 9
In this way, the DataFrame is two-dimensional data with ʻindex and
columns` specified.
When used in machine learning, index represents the type of data and columns represent the features of the data.
This is a very common task, as pandas basically reads and uses files.
Here, load the csv file that you created appropriately. The following data.
df = pd.read_csv('train.csv')
The following is the execution result.
It was hard to see even if I copied and pasted it, so I took a screenshot.
I wanted to index takash
and kenta
, but they are not indexed by default.
In this way, when reading data with an index, it must be specified by ʻindex_col. In this example, the leftmost data is treated as an index, so set ʻindex_col = 0
df = pd.read_csv('train.csv', index_col=0)
Also, by default the very first line is treated as a header. If you don't want to specify the very first line as the header, specify header = None
df = pd.read_csv('train.csv', header=None)
Let's check the shape of the data. Like numpy etc., the shape variable stores the dimension data.
df = pd.read_csv('train.csv', index_col=0)
(3, 3)
You can check the statistics of the data using the describe method.
df = pd.read_csv('train.csv', index_col=0)
In this way, you can get the number of data, mean, standard deviation, minimum, maximum, and quartile for each column.
You can check it with the code below.
df = pd.read_csv('train.csv', index_col=0)
<class 'pandas.core.frame.DataFrame'> Index: 3 entries, takash to yoko Data columns (total 3 columns): math 3 non-null int64 Engrish 3 non-null int64 society 3 non-null int64 dtypes: int64(3) memory usage: 96.0+ bytes None
You can check the data like this. No commentary is needed.
If you use the nunique method, you can check the data without duplication for each column you write.
df = pd.read_csv('train.csv', index_col=0)
>math 3 Engrish 3 society 3 dtype: int64
Since there is no duplication this time, the above results were obtained.
The index variable stores the index and the columns variable stores the column names. Let's check.
df = pd.read_csv('train.csv', index_col=0)
Index(['takash', 'kenta', 'yoko'], dtype='object') Index(['math', 'Engrish', 'society'], dtype='object')
You can see the location of the missing values in each column with the code below.
df = pd.read_csv('train.csv', index_col=0)
math Engrish society takash False False False kenta False False False yoko False False False
Since each value is not a missing value, False is returned.
Now let's get the sum of the missing values with the following code.
df = pd.read_csv('train.csv', index_col=0)
math 0 Engrish 0 society 0 dtype: int64
Now let's extract the data from the DataFrame. For the time being, I generated the following DataFrame.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
np.random.seed (0)
allows you to fix the random numbers generated by np.random.rand
. However, since I run the code every time, the random numbers change every time.
is the code that generates random numbers from 0 to 1.
Let's select and extract columuns with the code below.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
A 0.165899 B 0.144862 C 0.974517 D 0.144633 E 0.806085 Name: A1, dtype: float64
In this way, we were able to extract the columns.
You can use the loc method to specify the index and extract.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
A1 0.687867 A2 0.243104 A3 0.568371 A4 0.125892 A5 0.749777 Name: A, dtype: float64
I wrote that it is extracted by specifying the index, but the specification method with the loc method in DataFrame is quite similar to the specification of the two-dimensional array of numpy.
You can specify loc [row: column]
Let's see how to use it below.
print(df.loc[:, 'A1'])
A 0.108650 B 0.819086 C 0.250341 D 0.950634 E 0.852035 Name: A1, dtype: float64
Since :
is specified in the row part, it means that all rows are specified, and ʻA1` is specified in the column, so the columns of A1 are extracted.
print(df.loc['C', ['A2', 'A4']])
A2 0.129296 A4 0.367573 Name: C, dtype: float64
In this way, you can extract the data in rows A2 and A4.
Let's select and extract the conditions from the DataFrame. Let's check the behavior of df> 0.5
with the following code.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
print(df > 0.5)
In this way, True is stored when the value in the DataFrame satisfies the condition, and False is stored when the condition is not satisfied.
By using this, you can exclude values that do not meet the conditions as shown below.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
print(df > 0.5)
print(df[df > 0.5])
Also, you can extract only the rows that satisfy the specific columns by doing the following.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
print(df[df['A3'] > 0.5])
You can also add conditions using &, as shown below.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
print(df[(df['A3'] > 0.2) & (df['A3'] < 0.6)])
You can add columns by doing the following.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
df['new'] = np.arange(5)
You can delete a column by specifying the column name.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
df = df.drop(columns=['A1', 'A3'])
You can delete a line by specifying the line name.
df = pd.DataFrame(data=np.random.rand(5, 5),
index=['A', 'B', 'C', 'D', 'E'],
columns=('A1', 'A2', 'A3', 'A4', 'A5'))
df = df.drop(index=['A', 'D'])
Let's prepare the data as follows.
df = pd.DataFrame([[1, 2, 3, np.nan, 5],
[np.nan, 7, 8, 9, 10],
[11, np.nan, 13, 14, 15],
[16, 17, np.nan, 19, 20],
[21, 22, 23, 24, np.nan]],
index=['A', 'B', 'C', 'D', 'E'],
columns=['A1', 'A2', 'A3', 'A4', 'A5'])
You can use the dropna method to drop the line that contains the missing value.
df = df.dropna()
Columns: [A1, A2, A3, A4, A5] Index: []
This time, all the rows have missing values, so they all disappeared. In this way, if you apply strong restrictions, it will be difficult for data to remain.
You can remove the missing values for a particular column by doing the following:
df = pd.DataFrame([[1, 2, 3, np.nan, 5],
[np.nan, 7, 8, 9, 10],
[11, np.nan, 13, 14, 15],
[16, 17, np.nan, 19, 20],
[21, 22, 23, 24, np.nan]],
index=['A', 'B', 'C', 'D', 'E'],
columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df[df['A3'].isnull() == False]
If you use ʻisnull`, True will be returned if the data is nan, and False if the data is not nan. Therefore, you can only delete rows that have missing values in A3 as shown above.
Argument of dropna
By specifying the argument of thresh
, it is possible to delete the rows other than the rows with values that are not missing values more than the number specified by the argument.
For example, thresh = 4
deletes rows that do not have more than 4 non-missing values.
df = pd.DataFrame([[1, 2, 3, np.nan, 5],
[np.nan, 7, 8, 9, 10],
[11, np.nan, 13, 14, 15],
[16, np.nan, np.nan, 19, 20],
[21, 22, 23, 24, np.nan]],
index=['A', 'B', 'C', 'D', 'E'],
columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df.dropna(thresh=4)
You can do the same for columns by setting ʻaxis = 1`.
df = pd.DataFrame([[1, 2, 3, np.nan, 5],
[np.nan, 7, 8, 9, 10],
[11, np.nan, 13, 14, 15],
[16, np.nan, np.nan, 19, 20],
[21, 22, 23, 24, np.nan]],
index=['A', 'B', 'C', 'D', 'E'],
columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df.dropna(thresh=4, axis=1)
For a particular column, you can substitute the average for that column with the missing values for that column:
df = pd.DataFrame([[1, 2, 3, np.nan, 5],
[np.nan, 7, 8, 9, 10],
[11, np.nan, 13, 14, 15],
[16, np.nan, np.nan, 19, 20],
[21, 22, 23, 24, np.nan]],
index=['A', 'B', 'C', 'D', 'E'],
columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df['A3'] = df['A3'].fillna(df['A3'].mean())
You can substitute the mean value for the missing value for all columns by doing the following.
df = pd.DataFrame([[1, 2, 3, np.nan, 5],
[np.nan, 7, 8, 9, 10],
[11, np.nan, 13, 14, 15],
[16, np.nan, np.nan, 19, 20],
[21, 22, 23, 24, np.nan]],
index=['A', 'B', 'C', 'D', 'E'],
columns=['A1', 'A2', 'A3', 'A4', 'A5'])
df = df.fillna(df.mean())
Let's create the following DataFrame.
df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
'A2': [1, 2, 3, 4, 5, 6, 7],
'A3': [8, 9, 10, 11, 12, 13, 14]})
Let's check the category and the number of data with the code below.
B 3 A 2 C 1
You can retrieve only the data of a specific category with the following code.
print(df[df['A1'] == 'B'])
A1 A2 A3 2 B 3 10 3 B 4 11 4 B 5 12
You can fill in the missing values for categorical data with the following code. Since the mode is returned by mode () [0]
, the mode is assigned to the missing value.
df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
'A2': [1, 2, 3, 4, 5, 6, 7],
'A3': [8, 9, 10, 11, 12, 13, 14]})
df['A1'] = df['A1'].fillna(df['A1'].mode()[0])
Let's calculate the percentage of categorical data with the code below.
df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
'A2': [1, 2, 3, 4, 5, 6, 7],
'A3': [8, 9, 10, 11, 12, 13, 14]})
print(round(df['A1'].value_counts() / len(df), 3))
B 0.429 A 0.286 C 0.143
You can use the code below to group categorical data and calculate statistics.
df = pd.DataFrame({'A1': ['A', 'A', 'B', 'B', 'B', 'C', np.nan],
'A2': [1, 2, 3, 4, 5, 6, 7],
'A3': [8, 9, 10, 11, 12, 13, 14]})
By default, ʻaxis = 0`, so they are combined vertically.
df1 = pd.DataFrame(data=np.random.rand(3, 3),
index=['A', 'B', 'C'],
columns=['A1', 'A2', 'A3'])
df2 = pd.DataFrame(data=np.random.rand(3, 3),
index=['D', 'E', 'F'],
columns=['A1', 'A2', 'A3'])
df3 = pd.concat([df1, df2])
If you specify ʻaxis = 1` as shown below, you can combine them horizontally.
You need to match columns
when joining vertically and ʻindex` when joining horizontally.
df1 = pd.DataFrame(data=np.random.rand(3, 3),
index=['A', 'B', 'C'],
columns=['A1', 'A2', 'A3'])
df2 = pd.DataFrame(data=np.random.rand(3, 3),
index=['A', 'B', 'C'],
columns=['A4', 'A5', 'A6'])
df3 = pd.concat([df1, df2], axis=1)
You can apply a function to specific data by using ʻapply`.
df = pd.DataFrame(data=np.random.rand(3, 3),
index=['A', 'B', 'C'],
columns=['A1', 'A2', 'A3'])
df['A1'] = df['A1'].apply(lambda x: x ** 2)
When applying a function with multiple arguments to a DataFrame, it is convenient to define a function with a DataFrame as an argument.
df = pd.DataFrame(data=np.random.rand(3, 3),
index=['A', 'B', 'C'],
columns=['A1', 'A2', 'A3'])
def matmul(df):
return df['A1'] * df['A2']
df['A4'] = df.apply(matmul, axis=1)
If you have multiple return values, you can receive them by doing the following.
df = pd.DataFrame(data=np.random.rand(3, 3),
index=['A', 'B', 'C'],
columns=['A1', 'A2', 'A3'])
def square_and_twice(x):
return pd.Series([x**2, x*2])
df[['square', 'twice']] = df['A3'].apply(square_and_twice)
This is the end of this article.
Thank you for your relationship.
Recommended Posts