Correspondence between pandas and SQL

With pandas, you can perform data operations similar to SQL. Make a note of the correspondence between SQL data manipulation and pandas data manipulation. I personally write them in descending order of frequency of use.

limit -> head It is used to extract an arbitrary number of lines.

select *
from table
limit 10

↓pandas

table.head(10)

order by -> sort_values Sorting

select *
from table
order by col1

↓pandas

table.sort_values(by='col1')

Sorting by order by and sort_values is in ascending order by default. If order by, add desc at the end, but if sort_values, set ascending = False (if True, ascending order).

table.sort_values(by='col1', ascending=False)

You can also sort by multiple conditions

select *
from table
order by col1, col2 desc

↓pandas

table.sort_values(by=['col1', 'col2'], ascending=[True, False])

where -> []

select *
from table
where col1 = 'a'

↓pandas

table[table.col1 == 'a']

in clause

select *
from table
where col1 in ('a', 'b')

↓pandas

table[table.col1.isin(['a', 'b'])]

Multiple conditions

select *
from table
where col1 = 'a' and col2 = 1

↓pandas

table[(table.col1 == 'a')&(table.col2 == 1)]

Enclose each condition in () and connect with &. Use | for or instead of and

table[(table.col1 == 'a')|(table.col2 == 1)]

These condition specifications can also be used at the time of delete and update described below.

Function-> agg

select
    sum(col1),
    avg(col1),
    max(col1),
    min(col2),
    count(col2),
    count(distinct(col2)),
from table

↓pandas

table.agg({'col1': ['sum', 'mean', 'max'], 'col2': ['min', 'count', 'nunique']})

nunique supports count distinct.

The distinct function looks like this:

select
    distinct(col1)
from talbe

↓pandas

table.agg({'col1': 'unique'})

Aggregate functions and distinct can't be used together in sql, but can be used in pandas

table.agg({'col1': ['sum', 'mean', 'max', 'min', 'count', 'nunique', 'unique']})

Can be done like this.

sql and pandas function support

sum -> sum
avg -> mean
max -> max
min -> min
count -> count
count distinct -> nunique
distinct -> unique

group by -> groupby

select
    col1,
    sum(col2)
from table
group by col1

↓pandas

table.groupby('col1').agg({'col2': 'sum'})

The column specified in groupby will be an index in the resulting data frame, so

table.groupby('col1', as_index=False).agg({'col2': 'sum'})

It is easy to handle if you put as_index = False like.

When group by multiple columns

select
    col1,
    col2,
    sum(col3)
from table
group by col1, col2

↓pandas

table.groupby(['col1', 'col2'], as_index=False).agg({'col3': 'sum'})

join -> merge

select
    *
from table t
left join table2 t2
on table.col1 = table2.col1

↓pandas

pd.merge(table, table2, on='col1', how='left')

how also has inner, outer, and right.

When using multiple columns with on

select
    *
from table t
left join table2 t2
on table.col1 = table2.col1
    and table.col2 = table2.col2

↓pandas

pd.merge(table, table2, on=['col1', 'col2'], how='left')

Specify a list for on.

update -> loc[]

update table
set col1 = 'b'
where col2 = 1

↓pandas

table.loc[table.col2 == 1, 'col1'] = 'b'

Specify the where condition in the first argument of .loc and the column to be updated in the second argument.

delete -> []

delete
from table
where col1 = 'a'

↓pandas

table = table[table.col1 != 'a']

Only those that do not match the conditions are extracted and deleted.

column alias-> rename

select
    col1 as new_col1
from table

↓pandas

table.rename(columns={'col1': 'new_col1'})

union -> concat

select
    *
from table
union
select
    *
from table2

↓pandas

pd.concat([table, table2], ignore_index=True)

Since the first argument of the concat function is a list, it is possible to union three or more data frames. If ignore_index is not specified, the index value will be inherited from the data frame before UNION. In other words, the index value may be covered, so it may be better to specify it.

insert -> append

insert into
table(
    col1,
    col2
)
values(
    'a',
    '2'
)

↓pandas

table = table.append({'col': 'a', 'b': 2}, ignore_index=True)

ignore_index is the same as union.

case when -> apply

select
    case when col1 = 'a' then 1
            when col1 = 'b' then 2
            else 3 end as case_col1
from table

↓pandas

def case_when(x):
    if x.col1 == 'a':
        return 1
    elif x.col1 == 'b':
        return 2
    else:
        return 3

table.apply(case_when, axis=1)

The reason why axis = 1 is to fetch data line by line. If axis = 0, column data will be entered in x of case_when. It is easy to understand if you display x with print. The type of x will be the Series type of pandas.

It is also possible to add a column and return it. The lower method is the upper method table['case_col1'] = table.apply(case_when, axis=1) The result is the same as the case of.

select
    *,
    case when col1 = 'a' then 1
            when col1 = 'b' then 2
            else 3 end as case_col1
from table

↓pandas

def case_when(x):
    if x.col1 == 'a':
        ret = 1
    elif x.col1 == 'b':
        ret = 2
    else:
        ret = 3
    
    x['case_col1'] = ret

    return x

table.apply(case_when, axis=1)

group by case when -> groupby.apply

select
    col1,
    sum(case when col2 < 10 then col3 else 0 end) as sum_under10_col3,
    sum(case when col2 >= 10 then col3 else 0 end) as sum_over10_col3
from table
group by col1

↓pandas

def case_when(df):
    sum_under10_col3 = df[df.col2 < 10].col3.sum()
    sum_over10_col3 = df[df.col2 >= 10].col3.sum()
    return pd.Series({'sum_under10_col3': sum_under10_col3, 'sum_over10_col3': sum_over10_col3})

table.groupby('col1').apply(case_when).reset_index()

It's quite complicated, but you can do the same thing. Since data frames are sent to the case_when function, they are aggregated and the result is returned in Series. reset_index () removes col1 from the index. When using apply, if as_index = False, col1 will not be added to the column automatically. You probably have to include col1 in the Series when returning in the case_when function. It's a hassle, so reset_index supports it.