These calculations can be obtained immediately by using existing functions, but we will implement them from scratch in the sense of studying the contents.
A : 0, 3, 3, 5, 5, 5, 5, 7, 7, 10 B : 0, 1, 2, 3, 5, 5, 7, 8, 9, 10 C : 3, 4, 4, 5, 5, 5, 5, 6, 6, 7
Calculate the mean difference and Gini coefficient for the data in.
Here, the mean difference and the Gini coefficient are defined by the following formulas, respectively.
\frac{Σ_iΣ_j|x_i-x_j|}{n^2}
\frac{Σ_iΣ_j|x_i-x_j|}{2n^2 \bar{x}}
Therefore, I wrote the program as follows.
import numpy as np
A = np.array([0,3,3,5,5,5,5,7,7,10])
B = np.array([0,1,2,3,5,5,7,8,9,10])
C = np.array([3,4,4,5,5,5,5,6,6,7])
#Mean difference
def ave_diff(x):
n=len(x)**2
result = [np.abs(x[i]-x[j])/n for i in range(len(x)) for j in range(len(x))]
return sum(result)
"""
print(ave_diff(A))
print(ave_diff(B))
print(ave_diff(C))
2.76
3.7599999999999976
1.2000000000000008
"""
#Gini coefficient
def get_gini(x):
def get_gini(x):
n=len(x)**2
x_bar=x.mean()
result = [np.abs(x[i]-x[j])/(2*n*x_bar) for i in range(len(x)) for j in range(len(x))]
return sum(result)
"""
print(get_gini(A))
print(get_gini(B))
print(get_gini(C))
0.2760000000000002
0.3760000000000002
0.12000000000000008
"""
When p_i = f_i / n
H(p_1, p_2, ...., p_n) = -Σp_iilog(p_ii)
It is defined by. This amount is called entropy, and the larger H is, the more uniform the distribution is, and the smaller H is, the more concentrated it is.
example I asked 100 students where they came from. The following results were obtained for 10 years ago and this year. Compare the distribution of this place of origin from the standpoint of concentration.
area | A | B | C | D | E | Total |
---|---|---|---|---|---|---|
This year | 32 | 19 | 10 | 24 | 15 | 100 |
10 years ago | 28 | 13 | 18 | 29 | 12 | 100 |
import numpy as np
a=np.array([32, 19, 10, 24, 15])
b=np.array([28,13,18,29,12])
def entropy(x):
H=0
n=sum(x)
H=[x[i]/n*np.log10(x[i]/n) for i in range(len(x))]
# for i in range(len(x)):
# p=a[i]/n
# H.append(p*np.log10(p))
return -sum(H)
"""
print(entropy(a))
print(entropy(b))
0.667724435887455
0.6704368955892825
"""
Calculate the standard score and deviation score for data B
Standard score / standardization
z_i = \frac{xi-\bar{x}}{S_x}
So
def standard_score(x):
x_bar = x.mean()
s=np.sqrt(x.var())
z = [(x[i]-x_bar)/s for i in range(len(x))]
return z
"""
standard_score(B)
[-1.5214515486254614,
-1.217161238900369,
-0.9128709291752768,
-0.6085806194501845,
0.0,
0.0,
0.6085806194501845,
0.9128709291752768,
1.217161238900369,
1.5214515486254614]
"""
About deviation value score
T_i = 10z_i * 50
So I changed the above function a little and
def dev_val(x):
x_bar = x.mean()
s=np.sqrt(x.var())
T = [(x[i]-x_bar)/s*10 +50 for i in range(len(x))]
return T
'''
def_val(B)
[34.78548451374539,
37.82838761099631,
40.87129070824723,
43.91419380549816,
50.0,
50.0,
56.08580619450184,
59.12870929175277,
62.17161238900369,
65.21451548625461]
'''
It will be.