Consider how to divide the dataframe in the middle of calculation and apply the function to each of the divided dataframes. Note that it seems to be used frequently.
An example of here. Group mtcars with cyl ⇒Apply regression analysis to each divided data.frame ⇒ Issue a summary of each result. ⇒ Issue each R2. The flow.
library(purrr)
mtcars %>%
split(.$cyl) %>% # from base R
map(~ lm(mpg ~ wt, data = .)) %>%
map(summary) %>%
map_dbl("r.squared")
#> 4 6 8
#> 0.5086326 0.4645102 0.4229655
Concise!
If you do the same thing with python. While referring to the answer here.
import pandas as pd
data = pd.read_csv('https://gist.githubusercontent.com/ZeccaLehn/4e06d2575eb9589dbe8c365d61cb056c/raw/64f1660f38ef523b2a1a13be77b002b98665cdfe/mtcars.csv')
data.rename(columns={'Unnamed: 0':'brand'}, inplace=True)
d = dict(tuple(data.groupby(["cyl"])))
print(d)
brand mpg cyl disp hp ... qsec vs am gear carb
2 Datsun 710 22.8 4 108.0 93 ... 18.61 1 1 4 1
7 Merc 240D 24.4 4 146.7 62 ... 20.00 1 0 4 2
8 Merc 230 22.8 4 140.8 95 ... 22.90 1 0 4 2
17 Fiat 128 32.4 4 78.7 66 ... 19.47 1 1 4 1
18 Honda Civic 30.4 4 75.7 52 ... 18.52 1 1 4 2
19 Toyota Corolla 33.9 4 71.1 65 ... 19.90 1 1 4 1
20 Toyota Corona 21.5 4 120.1 97 ... 20.01 1 0 3 1
25 Fiat X1-9 27.3 4 79.0 66 ... 18.90 1 1 4 1
26 Porsche 914-2 26.0 4 120.3 91 ... 16.70 0 1 5 2
27 Lotus Europa 30.4 4 95.1 113 ... 16.90 1 1 5 2
31 Volvo 142E 21.4 4 121.0 109 ... 18.60 1 1 4 2
[11 rows x 12 columns]
brand mpg cyl disp hp ... qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 ... 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 ... 17.02 0 1 4 4
3 Hornet 4 Drive 21.4 6 258.0 110 ... 19.44 1 0 3 1
5 Valiant 18.1 6 225.0 105 ... 20.22 1 0 3 1
9 Merc 280 19.2 6 167.6 123 ... 18.30 1 0 4 4
10 Merc 280C 17.8 6 167.6 123 ... 18.90 1 0 4 4
29 Ferrari Dino 19.7 6 145.0 175 ... 15.50 0 1 5 6
[7 rows x 12 columns]
brand mpg cyl disp hp ... qsec vs am gear carb
4 Hornet Sportabout 18.7 8 360.0 175 ... 17.02 0 0 3 2
6 Duster 360 14.3 8 360.0 245 ... 15.84 0 0 3 4
11 Merc 450SE 16.4 8 275.8 180 ... 17.40 0 0 3 3
12 Merc 450SL 17.3 8 275.8 180 ... 17.60 0 0 3 3
13 Merc 450SLC 15.2 8 275.8 180 ... 18.00 0 0 3 3
14 Cadillac Fleetwood 10.4 8 472.0 205 ... 17.98 0 0 3 4
15 Lincoln Continental 10.4 8 460.0 215 ... 17.82 0 0 3 4
16 Chrysler Imperial 14.7 8 440.0 230 ... 17.42 0 0 3 4
21 Dodge Challenger 15.5 8 318.0 150 ... 16.87 0 0 3 2
22 AMC Javelin 15.2 8 304.0 150 ... 17.30 0 0 3 2
23 Camaro Z28 13.3 8 350.0 245 ... 15.41 0 0 3 4
24 Pontiac Firebird 19.2 8 400.0 175 ... 17.05 0 0 3 2
28 Ford Pantera L 15.8 8 351.0 264 ... 14.50 0 1 5 4
30 Maserati Bora 15.0 8 301.0 335 ... 14.60 0 1 5 8
[14 rows x 12 columns]
It was confirmed that the key became the unique value of cyl before splitting, it became the dataframe after grouping, and the dataframe was successfully split and put into the dictionary, so while turning with the key, apply the function (lm) to each dataframe. Application. ⇒ Store the result in a dictionary (summary). The flow.
import statsmodels.api as sm
def lm(y_train,X_train):
model = sm.OLS(y_train, sm.add_constant(X_train))
result = model.fit()
return(result)
d = dict(tuple(data.groupby(["cyl"])))
summary = {}
for key in d:
y_train = d[key]["mpg"]
X_train = d[key]["wt"]
summary[key] = lm(y_train,X_train)
print("#cyl{}:{}".format(key,summary[key].rsquared))
#cyl4:0.5086325963231395
#cyl6:0.4645101505505491
#cyl8:0.42296553649611224
Either is relatively easy to do. I thought R was better code this time because purrr's map makes even the application of regression analysis very concise. purrr is deep.