Processing summary 2 often done in Pandas (data reference, editing operation)

Continuing from Last time, this time I would like to summarize the processes that are often used when referencing and editing data.

Pandas version 0.25.1 is used for operation verification.

First, run the code below to create a sample dataframe.

import pandas as pd

df = pd.DataFrame({
    'name' : ['userA', 'userB', 'userC', 'userD', 'userE'],
    'age' : [20, 25, 41, 33, 22],
    'sex' : ['male', 'female', 'female', 'male', 'male'],
    'price' : 10000
})

	name	age	sex	price
0	userA	20	male	10000
1	userB	25	female	10000
2	userC	41	female	10000
3	userD	33	male	10000
4	userE	22	male	10000

To get a specific column

If you specify by column name (column name), the data of the corresponding column will be You can get it in Series format.

df['name']

	name
0	userA
1	userB
2	userC
3	userD
4	userE

When retrieving multiple columns

If you specify in the list of column names (column names), the data of the corresponding column (plural) will be displayed. It can be obtained in the form of DataFrame.

target_columns_list = ['name', 'age']
df[target_columns_list]

	name	age
0	userA	20
1	userB	25
2	userC	41
3	userD	33
4	userE	22

When adding a new column

If you specify a new column name and perform an assign operation, the column will be added.

If you specify an existing column name, that column will be overwritten.

df['new'] = df['age'] * 10

	name	age	sex	price	new
0	userA	20	male	10000	200
1	userB	25	female	10000	250
2	userC	41	female	10000	410
3	userD	33	male	10000	330
4	userE	22	male	10000	220

When retrieving only specific row data

If you want to get only specific row data, specify the index in the loc method. You can get the data of the specified row in the Series format.

df.loc[0]

	name	age	sex	price
0	userA	20	male	10000

When retrieving multiple lines of data

If you want to get multiple rows of data, specify a list of indexes in the loc method. You can get all the data in the specified row in the DataFrame format.

target_index_list = [0, 1, 2]
df.loc[target_index_list]

	name	age	sex	price
0	userA	20	male	10000
1	userB	25	female	10000
2	userC	41	female	10000

When retrieving data by specifying rows and columns

It's not a very frequent operation, but if you specify the index and column name in the loc method, You can get only the specified element.


df.loc[1, 'sex']
# female

#If you assign to the element obtained above, it will be updated.
df.loc[1, 'sex'] = 'updated'

	name	age	sex	price
0	userA	20	male	10000
1	userB	25	updated	10000
2	userC	41	female	10000
3	userD	33	male	10000
4	userE	22	male	10000

When extracting rows by specifying conditions

#For example, when extracting rows with age 25 or more
df[(df['age'] >= 25)]

#At the beginning~If you add, it becomes NOT
df[~(df['age'] >= 25)]

#When extracting rows with AND
df[(df['age'] >= 25) & (df['sex'] == 'female')]

#When extracting rows with OR
df[(df['age'] < 25) | (df['age'] > 40)]

The following is an example of extracting rows with AND.

	name	age	sex	price
1	userB	25	female	10000
2	userC	41	female	10000

When calculating statistics with groupby

#This is convenient when you want to add the result of calculating statistics for each category to the original data frame.
#The following calculates the average age for each gender
#Other statistics can be calculated by changing the argument specified for transform.
df['average_age_by_sex'] = df.groupby('sex')['age'].transform('mean')

#If you just want to see the average on jupyter notebook etc., the following is OK
# df.groupby('sex')['age'].mean()

	name	age	sex	price	new	average_age_by_sex
0	userA	20	male	10000	200	25
1	userB	25	female	10000	250	33
2	userC	41	female	10000	410	33
3	userD	33	male	10000	330	25
4	userE	22	male	10000	220	25

When implementing the simplest loop processing in Pandas

With iterrows, the index and the data in that row are in Series format, row by row. You can get it at. However, if the number of lines in the data frame increases, the processing will slow down, so I think that the data frame is about 100 lines, and it is limited to use when you want to skip a little.

for index, row in df.iterrows():
    print(index)
    print(type(row))

When creating a new column by referring to a specific column

In the case like the title, and when you want to use the if statement, you will want to process line by line using iterrows, but in most cases it is better to stop from the viewpoint of processing speed. is. In that case, you can speed up the process by using the apply method as shown below.

df['age_boundary'] = df['age'].apply(lambda x: '25 years and over' if x >=25 else 'Under 25 years old')

#Even if you define a function and pass it to the apply method, the result is the same
# def get_age_boundary(age):
#     if age >= 25:
#         return '25 years and over'
#     else:
#         return 'Under 25 years old'

# df['age_boundary'] = df['age'].apply(get_age_boundary)

	name	age	sex	price	age_boundary
0	userA	20	male	10000	Under 25 years old
1	userB	25	female	10000	25 years and over
2	userC	41	female	10000	25 years and over
3	userD	33	male	10000	25 years and over
4	userE	22	male	10000	Under 25 years old

When creating a new column by referring to multiple columns

In some cases, you may need to refer to multiple columns to create a new column. In that case, you can refer to the value of each column for each row by specifying the argument axis = 1 of the apply method.

This writing method should be avoided as much as possible in terms of processing speed.

def get_price_with_discount_rate(row):

    age = row['age']
    price = row['price']
    
    discount_rate = 1.0
    
    if age >= 40:
        discount_rate = 0.5
            
    return int(price * discount_rate)

df['price_with_discount_rate'] = df.apply(get_price_with_discount_rate, axis=1)

	name	age	sex	price	price_with_discount_rate
0	userA	20	male	10000	10000
1	userB	25	female	10000	10000
2	userC	41	female	10000	5000
3	userD	33	male	10000	10000
4	userE	22	male	10000	10000

When using np.vectorize

In the above case of creating a new column by referring to multiple columns, we took the method of specifying the argument axis = 1 of the apply method, but this method has a problem in terms of processing speed. As an alternative to the above, I haven't used it in practice, but you might want to consider using Numpy's vectorize method as well.

import numpy as np

def get_price_with_discount_rate(age, price):
        
    discount_rate = 1.0
    
    if age >= 40:
        discount_rate = 0.5
            
    return int(price * discount_rate)

vectorized_func=np.vectorize(get_price_with_discount_rate)
df['price_with_discount_rate'] = vectorized_func(df['age'], df['price'])

	name	age	sex	price	price_with_discount_rate
0	userA	20	male	10000	10000
1	userB	25	female	10000	10000
2	userC	41	female	10000	5000
3	userD	33	male	10000	10000
4	userE	22	male	10000	10000

Mutual conversion of Series, ndarray, list

How to convert between Pandas Series, NumPy ndarray, and Python standard list I often forget it, so I will summarize it here.

Other patterns can be realized in combination.


import numpy as np

# Series → ndarray
df['name'].values

# ndarray → list
df['name'].values.tolist()

# list → Series
pd.Series([1, 2, 3, 4, 5])

# list → ndarray
np.array([1, 2, 3, 4, 5])

In writing this article, I referred to the following sites. http://sinhrks.hatenablog.com/entry/2015/07/11/223124 https://qiita.com/3x8tacorice/items/3cc5399e18a7e3f9db86 https://note.nkmk.me/python-pandas-numpy-conversion/