Getting the top nth values in Pandas | Self-consideration Journey I found.
When you process data using Pandas DataFrame, you can easily get the maximum and minimum values of each column using methods such as max and min. However, at the moment (pandas ver 1.1.2), there is no function to get the second maximum and minimum values, the third maximum and minimum values, and so on. [...] Therefore, in this article, I will introduce a script that can take as few lines as possible and obtain the top nth value of each column of DataFrame as shown in the image below.
In other words, when there is the following data frame df
,
import numpy as np
import pandas as pd
np.random.seed(0)
df = pd.DataFrame(np.random.permutation(50).reshape(10, 5))
df
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 28 | 11 | 10 | 41 | 2 |
1 | 27 | 38 | 31 | 22 | 4 |
2 | 33 | 35 | 26 | 34 | 18 |
3 | 7 | 14 | 45 | 48 | 29 |
4 | 15 | 30 | 32 | 16 | 42 |
5 | 20 | 43 | 8 | 13 | 25 |
6 | 5 | 17 | 40 | 49 | 1 |
7 | 12 | 37 | 24 | 6 | 23 |
8 | 36 | 21 | 19 | 9 | 39 |
9 | 46 | 3 | 0 | 47 | 44 |
It means that pandas does not have a function / method that extracts the upper (or lower) 3 for each column and acquires the following data frame.
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
1 | 46 | 43 | 45 | 49 | 44 |
2 | 36 | 38 | 40 | 48 | 42 |
3 | 33 | 37 | 32 | 47 | 39 |
In the data frame df
given in this example, all columns have the same data type, but since the actual data often has different data types for each column, usedf.apply ()
to column. Consider processing each time.
In other words, pass the following series s
and
0 | |
---|---|
0 | 28 |
1 | 27 |
2 | 33 |
3 | 7 |
4 | 15 |
5 | 20 |
6 | 5 |
7 | 12 |
8 | 36 |
9 | 46 |
We will create a function that returns the following series.
0 | |
---|---|
1 | 46 |
2 | 36 |
3 | 33 |
In the quoted article, the following functions are shown: The optional arguments are as follows.
--topnum
: Number of items to get. The default is 3
.
--getmin
: If set to True
, it will be acquired in ascending order. The default is descending order.
--getindex
: When set to True
, the index is returned instead of the value.
def getmax(series, topnum=3, getmin=False, getindex=False):
if getindex is False:
series = (series.sort_values(ascending=getmin).head(topnum)
.reset_index(drop=True))
series.index += 1
return series
else:
return series.sort_values(ascending=getmin).head(topnum).index
This method first sorts the entire series (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_values.html) and then Get Top, and Reset Index to start index 1 (https: // pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.reset_index.html) and then add 1.
But ** this wouldn't be **.
To get the top n items from a series in pandas, [pd.Series.nlargset ()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series. nlargest.html) </ code> method (and
[pd.Series.nsmallest ()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.nsmallest. The html) </ code> method) is the optimal solution.
%timeit df[0].sort_values(ascending=False).head(3)
%timeit df[0].nlargest(3)
0 | |
---|---|
9 | 46 |
8 | 36 |
2 | 33 |
299 µs ± 9.43 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
153 µs ± 1.58 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
It's also twice as fast (although you could consider using NumPy for speed, but it's a bit less concise in your code).
It is also quite redundant to add a serial number index starting from 1. Where you can set the serial number starting from 1 directly, the index starts from 0 and then 1 is added. You might think of using np.arange ()
or range ()
, but [pd.RangeIndex ()](https://pandas.pydata.org/pandas-docs/stable/reference/ Use api / pandas.RangeIndex.html) </ code>.
test_s = pd._testing.makeStringSeries(10000)
%timeit s2 = test_s.reset_index(drop=True); s2.index += 1
%timeit s2 = test_s.set_axis(range(1, len(test_s)+1))
%timeit s2 = test_s.set_axis(np.arange(1, len(test_s)+1))
%timeit s2 = test_s.set_axis(pd.RangeIndex(1, len(test_s)+1))
109 µs ± 2.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
40.2 µs ± 1.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
64.7 µs ± 1.49 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
39.8 µs ± 931 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
That's why it looks like this:
def getmax_rev(series, topnum=3, getmin=False, getindex=False):
out = series.nsmallest(topnum) if getmin else series.nlargest(topnum)
return out.index if getindex else out.set_axis(pd.RangeIndex(1, topnum+1))
Recommended Posts