Continuing from Last time, this time I would like to summarize the processes that are often used when referencing and editing data.
First, run the code below to create a sample dataframe.
import pandas as pd
df = pd.DataFrame({
'name' : ['userA', 'userB', 'userC', 'userD', 'userE'],
'age' : [20, 25, 41, 33, 22],
'sex' : ['male', 'female', 'female', 'male', 'male'],
'price' : 10000
})
name | age | sex | price | |
---|---|---|---|---|
0 | userA | 20 | male | 10000 |
1 | userB | 25 | female | 10000 |
2 | userC | 41 | female | 10000 |
3 | userD | 33 | male | 10000 |
4 | userE | 22 | male | 10000 |
If you specify by column name (column name), the data of the corresponding column will be You can get it in Series format.
df['name']
name | |
---|---|
0 | userA |
1 | userB |
2 | userC |
3 | userD |
4 | userE |
If you specify in the list of column names (column names), the data of the corresponding column (plural) will be displayed. It can be obtained in the form of DataFrame.
target_columns_list = ['name', 'age']
df[target_columns_list]
name | age | |
---|---|---|
0 | userA | 20 |
1 | userB | 25 |
2 | userC | 41 |
3 | userD | 33 |
4 | userE | 22 |
If you specify a new column name and perform an assign operation, the column will be added.
df['new'] = df['age'] * 10
name | age | sex | price | new | |
---|---|---|---|---|---|
0 | userA | 20 | male | 10000 | 200 |
1 | userB | 25 | female | 10000 | 250 |
2 | userC | 41 | female | 10000 | 410 |
3 | userD | 33 | male | 10000 | 330 |
4 | userE | 22 | male | 10000 | 220 |
If you want to get only specific row data, specify the index in the loc method. You can get the data of the specified row in the Series format.
df.loc[0]
name | age | sex | price | |
---|---|---|---|---|
0 | userA | 20 | male | 10000 |
If you want to get multiple rows of data, specify a list of indexes in the loc method. You can get all the data in the specified row in the DataFrame format.
target_index_list = [0, 1, 2]
df.loc[target_index_list]
name | age | sex | price | |
---|---|---|---|---|
0 | userA | 20 | male | 10000 |
1 | userB | 25 | female | 10000 |
2 | userC | 41 | female | 10000 |
It's not a very frequent operation, but if you specify the index and column name in the loc method, You can get only the specified element.
df.loc[1, 'sex']
# female
#If you assign to the element obtained above, it will be updated.
df.loc[1, 'sex'] = 'updated'
name | age | sex | price | |
---|---|---|---|---|
0 | userA | 20 | male | 10000 |
1 | userB | 25 | updated | 10000 |
2 | userC | 41 | female | 10000 |
3 | userD | 33 | male | 10000 |
4 | userE | 22 | male | 10000 |
#For example, when extracting rows with age 25 or more
df[(df['age'] >= 25)]
#At the beginning~If you add, it becomes NOT
df[~(df['age'] >= 25)]
#When extracting rows with AND
df[(df['age'] >= 25) & (df['sex'] == 'female')]
#When extracting rows with OR
df[(df['age'] < 25) | (df['age'] > 40)]
The following is an example of extracting rows with AND.
name | age | sex | price | |
---|---|---|---|---|
1 | userB | 25 | female | 10000 |
2 | userC | 41 | female | 10000 |
#This is convenient when you want to add the result of calculating statistics for each category to the original data frame.
#The following calculates the average age for each gender
#Other statistics can be calculated by changing the argument specified for transform.
df['average_age_by_sex'] = df.groupby('sex')['age'].transform('mean')
#If you just want to see the average on jupyter notebook etc., the following is OK
# df.groupby('sex')['age'].mean()
name | age | sex | price | new | average_age_by_sex | |
---|---|---|---|---|---|---|
0 | userA | 20 | male | 10000 | 200 | 25 |
1 | userB | 25 | female | 10000 | 250 | 33 |
2 | userC | 41 | female | 10000 | 410 | 33 |
3 | userD | 33 | male | 10000 | 330 | 25 |
4 | userE | 22 | male | 10000 | 220 | 25 |
With iterrows, the index and the data in that row are in Series format, row by row. You can get it at. However, if the number of lines in the data frame increases, the processing will slow down, so I think that the data frame is about 100 lines, and it is limited to use when you want to skip a little.
for index, row in df.iterrows():
print(index)
print(type(row))
In the case like the title, and when you want to use the if statement, you will want to process line by line using iterrows, but in most cases it is better to stop from the viewpoint of processing speed. is. In that case, you can speed up the process by using the apply method as shown below.
df['age_boundary'] = df['age'].apply(lambda x: '25 years and over' if x >=25 else 'Under 25 years old')
#Even if you define a function and pass it to the apply method, the result is the same
# def get_age_boundary(age):
# if age >= 25:
# return '25 years and over'
# else:
# return 'Under 25 years old'
# df['age_boundary'] = df['age'].apply(get_age_boundary)
name | age | sex | price | age_boundary | |
---|---|---|---|---|---|
0 | userA | 20 | male | 10000 | Under 25 years old |
1 | userB | 25 | female | 10000 | 25 years and over |
2 | userC | 41 | female | 10000 | 25 years and over |
3 | userD | 33 | male | 10000 | 25 years and over |
4 | userE | 22 | male | 10000 | Under 25 years old |
In some cases, you may need to refer to multiple columns to create a new column. In that case, you can refer to the value of each column for each row by specifying the argument axis = 1 of the apply method.
def get_price_with_discount_rate(row):
age = row['age']
price = row['price']
discount_rate = 1.0
if age >= 40:
discount_rate = 0.5
return int(price * discount_rate)
df['price_with_discount_rate'] = df.apply(get_price_with_discount_rate, axis=1)
name | age | sex | price | price_with_discount_rate | |
---|---|---|---|---|---|
0 | userA | 20 | male | 10000 | 10000 |
1 | userB | 25 | female | 10000 | 10000 |
2 | userC | 41 | female | 10000 | 5000 |
3 | userD | 33 | male | 10000 | 10000 |
4 | userE | 22 | male | 10000 | 10000 |
In the above case of creating a new column by referring to multiple columns, we took the method of specifying the argument axis = 1 of the apply method, but this method has a problem in terms of processing speed. As an alternative to the above, I haven't used it in practice, but you might want to consider using Numpy's vectorize method as well.
import numpy as np
def get_price_with_discount_rate(age, price):
discount_rate = 1.0
if age >= 40:
discount_rate = 0.5
return int(price * discount_rate)
vectorized_func=np.vectorize(get_price_with_discount_rate)
df['price_with_discount_rate'] = vectorized_func(df['age'], df['price'])
name | age | sex | price | price_with_discount_rate | |
---|---|---|---|---|---|
0 | userA | 20 | male | 10000 | 10000 |
1 | userB | 25 | female | 10000 | 10000 |
2 | userC | 41 | female | 10000 | 5000 |
3 | userD | 33 | male | 10000 | 10000 |
4 | userE | 22 | male | 10000 | 10000 |
How to convert between Pandas Series, NumPy ndarray, and Python standard list I often forget it, so I will summarize it here.
import numpy as np
# Series → ndarray
df['name'].values
# ndarray → list
df['name'].values.tolist()
# list → Series
pd.Series([1, 2, 3, 4, 5])
# list → ndarray
np.array([1, 2, 3, 4, 5])
In writing this article, I referred to the following sites. http://sinhrks.hatenablog.com/entry/2015/07/11/223124 https://qiita.com/3x8tacorice/items/3cc5399e18a7e3f9db86 https://note.nkmk.me/python-pandas-numpy-conversion/
Recommended Posts