When solving past questions of entrance exams for junior high school, high school, university, etc., the "average score of examinees", "average score of successful applicants", and "magnification (number of successful applicants / number of examinees)" are disclosed, but "minimum passing score" May be private. Here, we consider predicting the minimum passing score.
I used the entrance examination data (2013-2019) of M junior high school published on the following site. Gorogoro junior high school entrance exam https://www.goro-goro.net/2019-musashi
For the data for each year, the following operations were performed, assuming that the average score of examinees was $ \ mu_a $, the average score of successful applicants was $ \ mu_p $, and the magnification was $ r $. (Each operation will be explained in detail in the next section.)
After that, the predicted value of the lowest passing score and the actual value were plotted, and the accuracy was compared with the case of simple prediction (). () Prediction that the minimum passing score = (average score of examinees + average score of successful applicants) / 2
1. Determine the value of $ \ sigma $. Assume that the test taker's score follows a normal distribution with mean $ \ mu_a $ and standard deviation $ \ sigma $. </ font> </ b>
This assumption was considered valid when the number of test takers was large enough and there were no test takers with full or zero points.
Of course, the true score distribution is a discrete distribution, but we consider this as a continuous distribution.
The first $ \ sigma $ can be decided at last, and will be updated later to approach the optimum value.
If you look at the above graph as "probability distribution of the score of one examinee", the $ y $ axis is "probability density (probability when integrated)", and if you see it as "score distribution of all examinees" The $ y $ axis is "probability density x number of people (when integrated, it becomes the number of people)".
2. In the normal distribution, $ b $ such that the area of $ x> b $ is $ \ frac {1} {r} $ is the provisional minimum passing score. .. </ font> </ b>
This can be expressed as a mathematical formula
\int^{b}_{\infty} \frac{1}{\sqrt{2\pi{\sigma}^2}}\exp{\left(-\frac{(x-\mu_a)^2}{2\sigma^2}\right)} \mathrm{d}x = 1-\frac{1}{r}
Equivalent to finding $ b $ that satisfies.
The left side is the cumulative distribution function of this normal distribution with $ b $ substituted.
This formula is cannot be solved analytically </ b>, so I used programming (Python) this time (the code will be described later).
The straight line $ x = b $ is an image that separates passers and failers by the score distribution.
3. Calculate the average of the $ x> b $ part of the distribution, and use the provisional average score of successful applicants as $ {\ mu_p} ^ {\ prime} $. </ font> </ b>
This is equivalent to finding the average (expected value) of the truncated normal distribution </ b>.
{\mu_p}^{\prime} = \mu_a + \sigma\frac{\phi\left(\frac{b - \mu_a}{\sigma}\right)}{1-\Phi\left(\frac{b - \mu_a}{\sigma}\right)}
($ \ Phi (\ cdot) $ represents the probability density function of the standard normal distribution, and $ \ Phi (\ cdot) $ represents its cumulative distribution function).
Reference: https://bellcurve.jp/statistics/blog/18075.html
4. Update the standard deviation $ \ sigma $ from the magnitude relationship between $ {\ mu_p} ^ {\ prime} $ and $ \ mu_p $, and return to 1. Repeat this. </ font> </ b>
The above "prediction using the normal distribution" was performed, and the accuracy was compared with the "prediction that the minimum passing score = (average score of examinees + average score of successful applicants) / 2".
import numpy as np
from scipy.stats import norm
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score, mean_absolute_error
# 2013-2019 data from https://www.goro-goro.net/2019-musashi
jukensya = [433, 556, 519, 590, 577, 541, 569]
goukakusya = [177, 177, 185, 183, 187, 185, 186]
juken_heikin = [138.5, 173.3, 172.9, 167.0, 165.5, 186.9, 170.5]
goukaku_heikin = [166.5, 210.7, 210.3, 202.0, 197.0, 221.5, 205.2]
goukaku_saitei = [146, 192, 188, 184, 180, 201, 185]
goukaku_saitei_pred = []
for i in range(7): # 2013-Analysis of 7 times in 2019
r = jukensya[i] / goukakusya[i] #magnification
mu_a = juken_heikin[i] #Candidate average score
mu_p = goukaku_heikin[i] #Average score of successful applicants
sigma_l = 0.1
sigma_r = 1000
sigma = (sigma_l + sigma_r) / 2 #Standard deviation, 0.Search in the range of 1 to 1000
for i in range(100): #Repeat the dichotomy 100 times
b = norm.isf(1 / r, mu_a, sigma) #Provisional minimum passing score
mu_p_prime = mu_a + sigma * norm.pdf((b - mu_a) / sigma) \
/ (1 - norm.cdf((b - mu_a) / sigma)) #Provisional average score of successful applicants
if mu_p_prime < mu_p:
sigma_l = sigma
else:
sigma_r = sigma
sigma = (sigma_l + sigma_r) / 2
goukaku_saitei_pred.append(b)
#Prediction that the midpoint between the average score of examinees and the average score of successful applicants is the lowest passing score
goukaku_saitei_pred_rough = [(goukaku_heikin[i] + juken_heikin[i]) / 2 for i in range(7)]
## R^Confirmation of 2 and MAE##
print("Pred 1:Results of prediction using normal distribution")
print("R^2: {:.4f}".format(r2_score(goukaku_saitei, goukaku_saitei_pred)))
print("MAE: {:.4f}".format(mean_absolute_error(goukaku_saitei, goukaku_saitei_pred)))
print("")
print("Pred 2: (Candidate average score+Average score of successful applicants) /The result of the prediction that it is 2")
print("R^2: {:.4f}".format(r2_score(goukaku_saitei, goukaku_saitei_pred_rough)))
print("MAE: {:.4f}".format(mean_absolute_error(goukaku_saitei, goukaku_saitei_pred_rough)))
##Measured value-Creating a forecast plot##
fig = plt.figure(figsize=(10, 5))
lim = [140, 220] #Range of both axes
s = 17 #font size
# Pred 1:Results of prediction using normal distribution
ax0 = fig.add_subplot(1,2,1)
ax0.plot(goukaku_saitei, goukaku_saitei_pred, "o", markersize=8)
ax0.plot(lim, lim, "k-")
ax0.set_title('Pred 1', fontsize=s)
ax0.set_xlabel('True', fontsize=s)
ax0.set_ylabel('Predicted', fontsize=s)
ax0.tick_params(labelsize=s)
ax0.set_xlim(lim)
ax0.set_ylim(lim)
ax0.set_aspect('equal')
# Pred 2: (Candidate average score+Average score of successful applicants) /The result of the prediction that it is 2
ax1 = fig.add_subplot(1,2,2)
ax1.plot(goukaku_saitei, goukaku_saitei_pred_rough, "o", markersize=8)
ax1.plot(lim, lim, "k-")
ax1.set_title('Pred 2', fontsize=s)
ax1.set_xlabel('True', fontsize=s)
ax1.set_ylabel('Predicted', fontsize=s)
ax1.tick_params(labelsize=s)
ax1.set_xlim(lim)
ax1.set_ylim(lim)
ax1.set_aspect('equal')
#Save as PNG
fig.tight_layout()
fig.savefig("plot.png ")
output The R ^ 2 value is the coefficient of determination, and the closer it is to 1, the better the accuracy. MAE is the average of the absolute values of the prediction errors, and the closer it is to 0, the better the accuracy.
Pred 1:Results of prediction using normal distribution
R^2: 0.9892
MAE: 1.4670
Pred 2: (Candidate average score+Average score of successful applicants) /The result of the prediction that it is 2
R^2: 0.9583
MAE: 2.5571
It can be said that Pred 1, which is a prediction using a normal distribution, is more accurate </ b> than Pred 2, which is a simple prediction. The average prediction error is about 1.5 points, which seems to be sufficient as a guide for actual study. Also, looking at the plot of Pred 2, the prediction error is large in the data near 145 points in the lower left. This is probably because Pred 2 uses the "midpoint between the average score of examinees and the average score of successful applicants" regardless of the exam magnification, so it cannot respond to changes in the magnification (only this year). The magnification is as low as 2.4 times, 2.8 to 3.2 times in other years). On the other hand, in Pred 1, the prediction error is suppressed to the same level as other data or less, and it can be said that the method using the normal distribution can cope with the change in magnification.
In Pred 1, many data have a similar deviation, that is, the predicted value is about 1 to 2 smaller than the actual value. Therefore, if there is sufficient data, it is possible that the accuracy will be even better by finding the constant corresponding to "about 1 to 2" this time and adding the correction to the predicted value.
scipy.stats --API for scipy statistical functions --keisuke's blog http://kaisk.hatenadiary.com/entry/2015/02/17/192955
3-5.Skewness and kurtosis|Statistics time|Statistics WEB https://bellcurve.jp/statistics/course/17950.html
Recommended Posts