Even if the ntile function that was often used in SQL during decyl analysis is searched, It didn't come out, so I created a function.
Dedicated to the older brothers who felt "ntile is the best. Why isn't it Python .." in SQL.
Well, as a usage, Have df = pandas dataframe inserted, col = Decyl Enter the column name of the target you want to analyze. n = 10 is set, but if you want to make 100 groups, set n = 100 etc. If n_tile = True, return the dataframe given as columns for the number of decades, If set to False, a dataframe with various aggregate functions will be returned. agg = Aggregate function you want to use when n_tile False above, For those who say "just sum", use agg = ["sum"] as an argument.
def ntile(df, col, n=10, n_tile=True, ascending=False, aggs=["sum", "count", "min", "max", "median", "mean", "std"]):
# check pandas dataframe type
if type(df) == pd.core.frame.DataFrame:
# check int or float
if type(df[f"{col}"].max()) == np.int64 or type(df[f"{col}"].max()) == np.float64:
ntile_df = pd.DataFrame()
length = len(df)
df = df.sort_values(col, ascending=ascending).reset_index(drop=True)
div_n = math.ceil(len(df) / n)
# mark N group
for idx, num in enumerate(range(0, length, div_n)):
n_df = df[num: num+div_n]
n_df["decile"] = idx + 1
ntile_df = ntile_df.append(n_df)
# return ntile_df if need only decile data
if n_tile:
return ntile_df
# otherwise aggregate col
agg_df = ntile_df.groupby("decile").agg({f"{col}": aggs}).reset_index()
# clean columns
agg_df.columns = ["".join(c) for c in agg_df.columns]
# merge two data
ntile_df = ntile_df.merge(agg_df, on="decile").copy()
return ntile_df
raise TypeError(f"`{col}` must be int or float type")
raise TypeError(f"`{type(df)}` must be pandas.core.frame.DataFrame type")
import pandas as pd
import random
#Create a random price guy
price = random.sample(range(10000, 900000), k=10000)
#Create a dataframe using the above
data = pd.DataFrame(price, columns=["price"])
Contents is like this.
#Put data in the function. n=Leave the default of 10
data = ntile(data, "price")
#Group by with decile and apply various aggregate functions.
data.groupby("decile").agg({"price": ["sum", "count", "min", "max", "median", "mean"]}).reset_index()
n=10
n=100
data = ntile(data, "price", n=5 , n_tile=False)
data.pricemin.value_counts().sort_index().reset_index().rename(columns={"index": "price_segment", "pricemin": "count"})
And look at this.
Have you ever done something like the following? And "Isn't there a function that creates a class with n% of all classes in the target column?" Have you ever felt that?
I'm a beginner, so I've thought about the above several times. I looked at ntile every time and corrected it by hand, but from now on, it is possible to change it just by changing the part of n = 10. After all, python has a smaller turn than SQL.
I don't know if the search is insufficient, but it is strange that there is no ntile function in SQL like python, Well, I tried to make it as a practice, so if you have changed jobs from SQLer to pythoner, please use it.
Recommended Posts