With pandas, you can perform data operations similar to SQL. Make a note of the correspondence between SQL data manipulation and pandas data manipulation. I personally write them in descending order of frequency of use.
limit -> head It is used to extract an arbitrary number of lines.
select *
from table
limit 10
↓pandas
table.head(10)
order by -> sort_values Sorting
select *
from table
order by col1
↓pandas
table.sort_values(by='col1')
Sorting by order by and sort_values is in ascending order by default. If order by, add desc at the end, but if sort_values, set ascending = False (if True, ascending order).
table.sort_values(by='col1', ascending=False)
You can also sort by multiple conditions
select *
from table
order by col1, col2 desc
↓pandas
table.sort_values(by=['col1', 'col2'], ascending=[True, False])
where -> []
select *
from table
where col1 = 'a'
↓pandas
table[table.col1 == 'a']
in clause
select *
from table
where col1 in ('a', 'b')
↓pandas
table[table.col1.isin(['a', 'b'])]
Multiple conditions
select *
from table
where col1 = 'a' and col2 = 1
↓pandas
table[(table.col1 == 'a')&(table.col2 == 1)]
Enclose each condition in () and connect with &. Use | for or instead of and
table[(table.col1 == 'a')|(table.col2 == 1)]
These condition specifications can also be used at the time of delete and update described below.
select
sum(col1),
avg(col1),
max(col1),
min(col2),
count(col2),
count(distinct(col2)),
from table
↓pandas
table.agg({'col1': ['sum', 'mean', 'max'], 'col2': ['min', 'count', 'nunique']})
nunique supports count distinct.
The distinct function looks like this:
select
distinct(col1)
from talbe
↓pandas
table.agg({'col1': 'unique'})
Aggregate functions and distinct can't be used together in sql, but can be used in pandas
table.agg({'col1': ['sum', 'mean', 'max', 'min', 'count', 'nunique', 'unique']})
Can be done like this.
sql and pandas function support
group by -> groupby
select
col1,
sum(col2)
from table
group by col1
↓pandas
table.groupby('col1').agg({'col2': 'sum'})
The column specified in groupby will be an index in the resulting data frame, so
table.groupby('col1', as_index=False).agg({'col2': 'sum'})
It is easy to handle if you put as_index = False like.
When group by multiple columns
select
col1,
col2,
sum(col3)
from table
group by col1, col2
↓pandas
table.groupby(['col1', 'col2'], as_index=False).agg({'col3': 'sum'})
join -> merge
select
*
from table t
left join table2 t2
on table.col1 = table2.col1
↓pandas
pd.merge(table, table2, on='col1', how='left')
how also has inner, outer, and right.
When using multiple columns with on
select
*
from table t
left join table2 t2
on table.col1 = table2.col1
and table.col2 = table2.col2
↓pandas
pd.merge(table, table2, on=['col1', 'col2'], how='left')
Specify a list for on.
update -> loc[]
update table
set col1 = 'b'
where col2 = 1
↓pandas
table.loc[table.col2 == 1, 'col1'] = 'b'
Specify the where condition in the first argument of .loc and the column to be updated in the second argument.
delete -> []
delete
from table
where col1 = 'a'
↓pandas
table = table[table.col1 != 'a']
Only those that do not match the conditions are extracted and deleted.
select
col1 as new_col1
from table
↓pandas
table.rename(columns={'col1': 'new_col1'})
union -> concat
select
*
from table
union
select
*
from table2
↓pandas
pd.concat([table, table2], ignore_index=True)
Since the first argument of the concat function is a list, it is possible to union three or more data frames. If ignore_index is not specified, the index value will be inherited from the data frame before UNION. In other words, the index value may be covered, so it may be better to specify it.
insert -> append
insert into
table(
col1,
col2
)
values(
'a',
'2'
)
↓pandas
table = table.append({'col': 'a', 'b': 2}, ignore_index=True)
ignore_index is the same as union.
case when -> apply
select
case when col1 = 'a' then 1
when col1 = 'b' then 2
else 3 end as case_col1
from table
↓pandas
def case_when(x):
if x.col1 == 'a':
return 1
elif x.col1 == 'b':
return 2
else:
return 3
table.apply(case_when, axis=1)
The reason why axis = 1 is to fetch data line by line. If axis = 0, column data will be entered in x of case_when. It is easy to understand if you display x with print. The type of x will be the Series type of pandas.
It is also possible to add a column and return it. The lower method is the upper method table['case_col1'] = table.apply(case_when, axis=1) The result is the same as the case of.
select
*,
case when col1 = 'a' then 1
when col1 = 'b' then 2
else 3 end as case_col1
from table
↓pandas
def case_when(x):
if x.col1 == 'a':
ret = 1
elif x.col1 == 'b':
ret = 2
else:
ret = 3
x['case_col1'] = ret
return x
table.apply(case_when, axis=1)
group by case when -> groupby.apply
select
col1,
sum(case when col2 < 10 then col3 else 0 end) as sum_under10_col3,
sum(case when col2 >= 10 then col3 else 0 end) as sum_over10_col3
from table
group by col1
↓pandas
def case_when(df):
sum_under10_col3 = df[df.col2 < 10].col3.sum()
sum_over10_col3 = df[df.col2 >= 10].col3.sum()
return pd.Series({'sum_under10_col3': sum_under10_col3, 'sum_over10_col3': sum_over10_col3})
table.groupby('col1').apply(case_when).reset_index()
It's quite complicated, but you can do the same thing. Since data frames are sent to the case_when function, they are aggregated and the result is returned in Series. reset_index () removes col1 from the index. When using apply, if as_index = False, col1 will not be added to the column automatically. You probably have to include col1 in the Series when returning in the case_when function. It's a hassle, so reset_index supports it.
Recommended Posts