I currently live in Osaka, and I watched the city plan vote the day before yesterday while biting on TV.
Looking at the progress, I thought that the difference in the results for each ward was remarkable **, and I felt that it would be possible to obtain data suitable for analysis, so I analyzed it in detail.
Uploading used scripts and cleansed data to GitHub
As stated in Qiita Guidelines, avoid deep political consideration so as not to exceed the territory as a technical article. I will list only the facts obtained.
Also, I don't have advanced analytical skills like causal reasoning, so "I'd like to do this to gain deeper insight." If you have any advice on such ** methods, I would be very grateful if you could comment! ** **
The procedure to reach the conclusion will be described in the following chapters, but the following conclusions were obtained.
** "The higher the percentage of people in their 30s and 50s, the higher the approval rate." ** ** "The smaller the average number of households (the smaller the number of families), the higher the approval rate" ** ** "The higher the latitude (northern side), the higher the approval rate" ** ** "The higher the male ratio, the higher the approval rate" **
As described in TJO's blog, which is familiar in the field of data analysis,
I will proceed with the procedure
This time, I was interested in the following two points. (1) Bias between wards with many and few wards ② Difference from the previous vote
Regarding (1), as shown in the figure below (quoted from the Mainichi Shimbun website), there is a clear geographical bias in the approval rate (is the approval rate high toward the north of the central part of the city?) I wanted to reveal the factors (features) behind this bias on a numerical basis.
So ** Purpose: Extract valid factor candidates for high and low approval rates ** ** Objective variable: Agree rate ** ** Explanatory variable: Feature that is a candidate factor ** I will proceed with the analysis as
Regarding (2), I would like to extract the factors of the ward where the approval rate has changed significantly compared to the previous time. ** Objective variable: This time's approval rate ** ** Explanatory variable: Last approval rate ** As such, I would like to group outliers.
As mentioned above, we will extract the indicators that are likely to affect the approval rate, which is the objective variable, as features.
As shown in the NHK tabulation below, the exit poll shows a high approval rate for working people in their 30s and 50s. The approval rate for the elderly in their 60s and above and the youth in their 20s is relatively low. This time, I felt that it was a little peculiar that the tendency was reversed only in the 20s, so considering the presence or absence of this effect, ** Feature 1 "Percentage over 60 years old" ** (Does not consider the reversal phenomenon in the 20s) ** Feature 2 "Ratio of people in their 30s and 50s" ** (Considering the reversal phenomenon in their 20s) To the feature
There is no difference as much as age, but the gender ratio seems to have an effect, so ** Feature 3 "Male ratio" ** To the feature
As shown in the figure below, the new ward office (main government building) was planned to be concentrated in the center of the city.
I live in a ward different from the ward where the main government building is located, but I was a little worried that the area I live in would be left behind due to development, so this feeling (domain knowledge) Believe, ** Feature 4 "Time required from the station where the main government building is located to the old ward central station" ** Was added as a feature quantity
There is no clear basis for these, but I added them to the features because they are likely to affect voting behavior and data are easily available. ** Feature 5 "Average number of households" ** ** Feature 6 "Average annual income per person" **
I will omit the approval rate, which is the objective variable, because there are many sources. Various features were collected at the following locations and cleansing methods.
Use the following 6 features
Data acquisition method: Download the estimated population by age and gender in 2020 from the Osaka City website below https://www.city.osaka.lg.jp/toshikeikaku/page/0000015211.html Cleansing method: Calculate the ratio of the target age group to the whole and the ratio of men for each ward and make it into a field
Data acquisition / cleansing method: Use the transfer site below to find the time required from the old ward central station to the nearest station to the new ward office on Saturdays at 8, 12, 15, and 18:00 on average. https://ekitan.com/
How to get the data: Download the number of people per household in 2020 from the following Osaka City website https://www.city.osaka.lg.jp/toshikeikaku/page/0000068035.html Cleansing method: Fieldization for each ward
How to get the data: Download the income hierarchy data by household from the following Osaka City website https://www.city.osaka.lg.jp/shimin/cmsfiles/contents/0000180/180789/20.pdf Cleansing method: Calculate the average annual household income by accumulating the average annual income and percentage for each level, and divide by the average household number of people calculated in 5.
As shown in the figure below, the fielded data was obtained. This completes cleansing
In Japanese, there are various things that are not good when dealing with Python, so I will translate it into English.
This time, I would like to regress as "objective variable: approval rate" and "explanatory variable: 6 types of features in the previous chapter". Follow the steps below 4-1) Visualization of the entire data 4-2) Feature selection (exclusion of highly correlated features) 4-3) Creating a regression model 4-4) Performance evaluation 4-5) Model improvement
Use This tool created in the past to find the scatter plot and correlation coefficient.
#%%Read
import pandas as pd
from custom_pair_plot import CustomPairPlot
#Field to use
KEY_VALUE = 'ward_before'#Key row
OBJECTIVE_VARIALBLE = 'approval_rate'#Objective variable
EXPLANATORY_VALIABLES = ['1_over60','2_between_30to60','3_male_ratio','4_required_time','5_household_member','6_income']#Explanatory variable
#pair_Visualize correlation with analyzer
df = pd.read_csv(f'./osaka_metropolis_english.csv')
use_cols = [OBJECTIVE_VARIALBLE] + EXPLANATORY_VALIABLES
gp = CustomPairPlot()
gp.pairanalyzer(df[use_cols])
The following findings are likely to be obtained (1) Feature quantity that does not consider the reversal of 20s The correlation coefficient with the approval rate is higher in the considered "2. 30-50s ratio" than in the "1. 60 years old or older ratio" (as planned). (2) The male ratio has a low correlation coefficient with the approval rate (although it may be affected by outliers (Nishinari Ward) ...) (3) The correlation coefficient between the explanatory variables "2. 30-50s ratio" and "6. Average annual income" is as high as 0.89.
Although it is amakudari, this time we will use gradient boosting (XGBoost), which is a standard method of regression analysis these days. This method is said to be relatively strong against multicollinearity (a problem when using both highly correlated explanatory variables), but Still, if you use highly correlated variables as they are like this site, you can not ignore it. There seems to be an impact
Therefore, this time, the explanatory variables with high correlation coefficient with other explanatory variables were excluded from the analysis target. It seems that VIF = 10 (corresponding to R = 0.95) or 5 (corresponding to R = 0.9) is often used as the standard for multicollinearity. This time, we will take a closer look and exclude explanatory variables with a correlation coefficient of 0.9 or higher. In particular ** ・ "1. 60 years old or older ratio" is excluded ** ‥ "2. 30-50s ratio" has a correlation coefficient of -0.96, and the correlation with the approval rate is lower than 2. ** ・ "6. Average annual income per person is excluded" ** The correlation coefficient with "2. 30-50s ratio" is 0.89, which is slightly lower than the standard, but "Annual income is high when there are many jobs" Because a clear causal relationship can be identified, it is excluded
Therefore, we will use the following 4 features for future analysis. ** Feature 2 "Ratio of people in their 30s and 50s" ** (Considering the reversal phenomenon in their 20s) ** Feature 3 "Male ratio" ** ** Feature 4 "Time required from the station where the main government building is located to the old ward central station" ** ** Feature 5 "Average number of households" **
The scatter plot after the feature amount is excluded is as shown in the figure below. We will use this explanatory variable to proceed with the regression analysis. In addition, in order to emphasize that the following is a regression analysis, we will unify the term "features" to "explanatory variables".
As mentioned earlier, we use gradient boosting (XGBoost), which is the mainstream of recent regression analysis. There is a handy Python library that you can use easily, There are some hyperparameters, so optimize by grid search
Combine grid search and cross-validation to find the best hyperparameters.
To put it simply, cross-validation is "Dividing the data into N equal parts to find a parameter setting that balances performance and overfitting prevention." (Refer to the Wikipedia image below for the method of equally dividing the data).
Grid search is a method of brute force searching for a combination of parameters that has been decided in advance. The disadvantage is that it takes time, and the advantage is that it is easy to implement. I searched the net and used a range of parameters that are actually commonly used in XGBoost as candidates.
import xgboost as xgb
from sklearn import metrics as met
import sklearn as skl
from sklearn.model_selection import LeaveOneOut
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
import numpy as np
import matplotlib.pyplot as plt
import os
from datetime import datetime
import seaborn as sns
#Field to use
KEY_VALUE = 'ward_before'#Key row
OBJECTIVE_VARIALBLE = 'approval_rate'#Objective variable
USE_EXPLANATORY = ['2_between_30to60','3_male_ratio','4_required_time','5_household_member']#Explanatory variables to use
#Data reading
df = pd.read_csv(f'./osaka_metropolis_english.csv')
#Get objective and explanatory variables (convert to ndarray instead of pandas)
y = df[[OBJECTIVE_VARIALBLE]].values
X = df[USE_EXPLANATORY].values
#Common parameters for grid search and performance evaluation
num_round=10000#Maximum number of learnings
early_stopping_rounds=50#Learning is stopped when the evaluation index does not improve this number of times in a row
seed = 42#Random seed
#Parameters for grid search(For details https://qiita.com/R1ck29/items/50ba7fa5afa49e334a8f)
cv_params = {'eval_metric':['rmse'],#Data evaluation index
'objective':['reg:squarederror'],#Loss function to be minimized
'random_state':[seed],#Random seed
'booster': ['gbtree'],
'learning_rate':[0.1,0.3,0.5],
'min_child_weight':[1,5,15],
'max_depth':[3,5,7],
'colsample_bytree':[0.5,0.8,1.0],
'subsample':[0.5,0.8,1.0]
}
#XGBoost instance creation
cv_model = xgb.XGBRegressor()
#Instantiate grid search
# n_jobs=-When set to 1, CPU100%All cores are calculated in parallel. Very fast.
cv = GridSearchCV(cv_model, cv_params, cv = 5, scoring= 'r2', n_jobs =-1)
#Grid search execution
evallist = [(X, y)]
cv.fit(X,
y,
eval_set=evallist,
early_stopping_rounds=early_stopping_rounds
)
#Display of optimal parameters
print('Optimal parameters' + str(cv.best_params_))
print('Variable importance' + str(cv.best_estimator_.feature_importances_))
It may seem strange that the data used for grid search and the test data are not separated here, but this section is for the purpose of optimizing the parameters, and the performance evaluation will be performed separately in the next section. Therefore, please forgive me. (If you run the grid search for each ward with Leave_One_Out, the time required ...)
Performance evaluation is performed separately for test data and training data.
Use the following as indicators ** RMSE average: ** Larger prediction error as a whole (smaller is better) ** Predicted and actual correlation coefficient: ** The magnitude of the predicted and actual correlation (the larger the better) ** Maximum prediction error: ** Evaluate whether there is a section with extremely large prediction error (smaller is better)
There are several ways to divide
・ Since the number of data is small, it is necessary to secure the number of learning data. ・ I want to calculate the performance index for each ward and identify the wards that do not fit the model.
From the point of view Take out test data one by one and use the rest as training data ** "Leave_One_Out" ** to evaluate performance (See the figure below on Wikipedia, which evaluates performance with a model trained from data from all other wards)
The following code is added to the above parameter optimization code to evaluate the performance.
#%%3.Performance evaluation(Leave-One-Out)
#Use the best parameters in grid search for parameters
params = cv.best_params_
#DataFrame for holding results
df_result = pd.DataFrame(columns=['test_index','eval_rmse_min','train_rmse_min','num_train'])
#Leave-One-Performance evaluation by dividing data with Out
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):#Split loop for all data
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
dtrain = xgb.DMatrix(X_train, label=y_train)#Training data
dtest = xgb.DMatrix(X_test, label=y_test)#test data
evals = [(dtest, 'eval'),(dtrain, 'train')]#Specify training data and test data for displaying results
evals_result = {}#For holding results
#Learning execution
model = xgb.train(params,
dtrain,#Training data
num_boost_round=num_round,
early_stopping_rounds=early_stopping_rounds,
evals=evals,
evals_result=evals_result
)
#Model performance evaluation
test_pred = model.predict(dtest, ntree_limit=model.best_ntree_limit)
num_train = len(evals_result['eval']['rmse'])
df_result = df_result.append({'test_index': test_index[0],
'key_value': df[[KEY_VALUE]].iloc[test_index[0],0],
'pred_value': test_pred[0],
'real_value': df[[OBJECTIVE_VARIALBLE]].iloc[test_index[0],0],
'eval_rmse_min': evals_result['eval']['rmse'][num_train - 1],
'train_rmse_min': evals_result['train']['rmse'][num_train - 1],
'num_train': num_train},
ignore_index=True)
#Display of performance evaluation results
print('RMSE average' + str(df_result['eval_rmse_min'].mean()))
print('Correlation coefficient' + str(df_result[['pred_value','real_value']].corr().iloc[1,0]))
print('Maximum prediction error' + str(max((df_result['pred_value'] - df_result['real_value']).abs())))
#Output result
dt_now = datetime.now().strftime('%Y%m%d%H%M%S')
feat_use = 'feat' + '-'.join([ex.split('_')[0] for ex in USE_EXPLANATORY])
#Evaluation results
df_result.to_csv(f"{os.getenv('HOMEDRIVE')}{os.getenv('HOMEPATH')}\Desktop\{feat_use}_{dt_now}_result.csv")
path = f"{os.getenv('HOMEDRIVE')}{os.getenv('HOMEPATH')}\Desktop\{feat_use}_{dt_now}_result.txt"
with open(path, mode='w') as f:
f.write('Feature value' + str(USE_EXPLANATORY))
f.write('\n Optimal parameters' + str(cv.best_params_))
f.write('\n Grid search target' + str(cv_params))
f.write('\n variable importance' + str(cv.best_estimator_.feature_importances_))
f.write('\nRMSE average' + str(df_result['eval_rmse_min'].mean()))
f.write('\n Correlation coefficient' + str(df_result[['pred_value','real_value']].corr().iloc[1,0]))
f.write('\n Maximum prediction error' + str(max((df_result['pred_value'] - df_result['real_value']).abs())))
#Scatter plot display
sns.regplot(x="pred_value", y="real_value", data=df_result, ci=0)
The performance evaluation index obtained this time is as follows.
RMSE average (smaller is better): 0.0226 Predicted and actual correlation coefficient (the larger the better): 0.619 Maximum prediction error (smaller is better): 0.0709
Whether this index is good or bad will be evaluated in the next section, so ** First, let's look at a scatter plot comparing the predicted value (horizontal axis) and the actual value (vertical axis) **
Looking at the scatter plot above, I felt a little uncomfortable with the results of Fukushima Ward, where the difference between the predicted and measured values is large. For example, compared to Miyakojima Ward, which belongs to Kita Ward after the same merger,
Despite the fact that the values of all the features are close, the predicted values of Fukushima Ward are unnaturally low (rather, the predicted values are likely to be larger than those of Miyakojima Ward due to the large proportion of people in their 30s and 50s. ), It seems that the training data divided during Leave-One-Out has been overfitted. We will consider overfitting prevention methods in the following sections.
Since the phenomenon that seems to be overfitting occurred in the previous section, we will apply improvement measures to improve the performance of the model.
I couldn't think of a method other than grid search to prevent overfitting, so I narrowed down the explanatory variables to be used.
Explanatory variables | RMSE average (The smaller the better) |
Predicted and actual correlation coefficient (The larger the better) |
Maximum prediction error (The smaller the better) |
---|---|---|---|
2,3,4,5 Use all | 0.0226 | 0.619 | 0.0709 |
2,3,4 | 0.0213 | 0.648 | 0.0531 |
2,3,5 | 0.0191 | 0.744 | 0.0474 |
2,4,5 | 0.0222 | 0.614 | 0.0739 |
3,4,5 | 0.0221 | 0.642 | 0.0593 |
2,3 | 0.0197 | 0.707 | 0.0519 |
2,4 | 0.0181 | 0.733 | 0.0604 |
2,5 | 0.0248 | 0.572 | 0.0590 |
3,4 | 0.0229 | 0.568 | 0.0610 |
3,5 | 0.0232 | 0.596 | 0.0503 |
4,5 | 0.0304 | 0.460 | 0.0750 |
2 | 0.0209 | 0.646 | 0.0559 |
3 | 0.0303 | -0.996 | 0.0726 |
4 | 0.0246 | 0.541 | 0.0532 |
5 | 0.0292 | 0.407 | 0.0686 |
The combinations with particularly high performance have been bolded. The scatter plot is also shown below as before.
Correlation / causal relationship between the four explanatory variables used among three or more parties (Example: "The closer to the main government building, the more working generations in their 30s and 50s, and the smaller the number of families, so the average household number is Since it includes "less" etc.), it is hypothesized that using all of them may have an adverse effect.
Also, honestly, I can't deny the feeling that I couldn't get the best performance with 4 variables because I lacked my gradient boosting skill. ** If you have any knowledge such as "This will prevent overfitting!" **, I would appreciate it if you could comment.
Looking at the scatter plot of the predicted and measured values of the model using the explanatory variables 2, 3 and 5 with good performance, the wards with lower measured values than the predicted are the wards on the south side (especially Shin Tennoji shown in red). You can see that the ward corresponding to the ward) stands out. On the contrary, in the northern wards shown in green such as Kita Ward and Yodogawa Ward, there are many wards whose actual measurement values are higher than predicted.
It is said that the regional system of Osaka City differs greatly between the north side and the south side, and it is hypothesized that this regional difference influenced voting behavior.
Therefore, this time, we have added "latitude" as an explanatory variable that represents north and south **. (Since all the 〇 parts of "North latitude 〇 degrees △ minutes □ seconds" are "34 degrees", use values less than minutes this time)
Explanatory variables | RMSE average (The smaller the better) |
Predicted and actual correlation coefficient (The larger the better) |
Maximum prediction error (The smaller the better) |
---|---|---|---|
2,3,5 | 0.0191 | 0.744 | 0.0474 |
2,3,5+latitude | 0.0174 | 0.767 | 0.0444 |
2,4 | 0.0181 | 0.733 | 0.0604 |
2,4+latitude | 0.0182 | 0.755 | 0.0586 |
Both when the explanatory variable "2,3,5" is used and when "2,4" is used, the index is improved across the board. After all, the difference between north and south seems to be a factor in the approval rate.
The best performing model (using explanatory variables 2,3,5, latitude) uses the following features in descending order of importance. Feature 2 ** "Ratio of people in their 30s and 50s" ** Feature 5 ** "Average number of households" ** Additional features ** "Latitude" ** Feature 3 ** "Male ratio" **
The features used here can be regarded as factors that affect the approval rate, which is the objective variable.
I will avoid deep consideration of the cause, ** "The ratio of men in the prime working generation is large, the average number of households is small, and the ward on the north side" has a lot of support ** ** There is a lot of opposition in "the ratio of men in the prime working generation is small, there are many families, and the ward is on the south side" ** That seems to be the case. (As far as the importance of features in the previous section is seen, it seems that the influence of age composition is the largest and the influence of the male ratio is the smallest.)
Also, as you can see from the correlation analysis in 4-1. ** The fact that there is a relatively large amount of opposition from people in their 20s (a reversal phenomenon with those in their 30s) also affects the difference in approval rates between wards ** I think it can be said.
Through this analysis, I was able to get the following feelings.
One of the reasons for suspecting overfitting in 4.4 is "I can't think of any reason why Fukushima Ward is outlier." Fukushima Ward is a unique area famous for gourmet food, There was an average image (domain knowledge) in the city as a factor that influences voting, so the idea that "if this is an outlier, the performance of the model you made must be bad" I was able to reach.
In Chapter 2, based on the idea that the citizens feel that the farther they are from the main government building, the more inconvenient they feel and the lower the approval rate. The feature amount "4. Time required to reach the main government building" has been added.
When collecting data while clicking on the transfer information, ** "It must be a feature amount that leads to improved performance! The strongest domain knowledge!" ** I was in high tension while thinking, When I actually proceeded with the analysis, there was only a degree of influence that was buried in other features.
And since the data that I worked hard to collect will give me a feeling for it, ** "I don't want to throw away this feature! I want to use it by force" ** As an analyst, a feeling that lacks objectivity arises.
I was able to feel the difficulty of judging only from the facts obtained by sealing the preconceptions and feelings that arose from such domain knowledge.
When dealing with timely and highly topical data, I felt motivated. I will continue to analyze current affairs as a place to output the results of my studies!
It will be an extra analysis, but I will also analyze the wards where the difference from the previous vote is large.
I compared the approval rate of the previous 2015 vote with the approval rate of this time (horizontal axis: last time, vertical axis: this time) The tendency is almost the same as the previous time (R2 = 0.88), but there are some wards that have changed from the previous time (especially remarkable wards are red).
I think that one of the reasons for the change from the previous time mentioned above may be the change in the classification after the merger. I compared the classification between the last time and this time.
The figure below is from the Sankei Shimbun website
There is no change in the division of the central part of Chuo Ward and Kita Ward, It seems that the division of the surrounding wards (corresponding to the previous east ward, south ward, and bay ward) has changed significantly (the main government building has been incorporated into the ward located near the city center).
I tried to extract the wards where there was an increase or decrease of 2% or more from the previous time. Based on the classification in the previous section (Chuo Ward, Kita Ward or others), the changes in the classification after the merger are also shown.
Ward name | Changes in classification after the merger | Increase / decrease |
---|---|---|
Taisho Ward | Yes | +2.54% |
Nishinari Ward | None | +2.49% |
Kita Ward | None | -2.73% |
Chuo-ku | Yes | -3.12% |
Abeno Ward | Yes | -3.32% |
Nishi-ward | None | -3.89% |
Minato-ku | Yes | -4.82% |
I would like to analyze the wards that have changed after the merger and those that have not.
This corresponds to Minato Ward, Abeno Ward, and Taisho Ward. Although it will be amakudari, I feel that there are likely to be different factors in Minato Ward and other areas.
Minato Ward has a large decrease, but as mentioned in various reports, geographically before the vote. It seems that there was a voice pointing out the distance. Looking at the map after the merger, Minato Ward is very far from Jusou Station, where the main government building is located. Not only are they located far away, but they also have a small regional connection with a large river called the Yodo River in between. Even if you try to go by train, the transfer time at Umeda Station on the way is very long, which is very inconvenient. (If you compare it to Tokyo, the image of the positional relationship between Odaiba (Koto Ward) and Shinkoiwa (Katsushika Ward))
In the previous concept, the main government building was planned to be located in Minato Ward, Even in Konohana Ward, which has a similar positional relationship, the approval rate has dropped by 1.5%. It is speculated that the ** geographical distance to the main government building ** has an effect.
(Please forgive some subjectivity) Regarding Abeno Ward, in the previous concept, the main government building of the new "Minami Ward" was planned to be placed in Abeno Ward, but in this concept, the main government building will be placed in the old Tennoji Ward, and the name of the new ward is also "Tennoji Ward" ". [I want to make the name Abeno a national district because of the name "Abeno Harukas"](https://ja.wikipedia.org/wiki/%E3%81%82%E3%81%B9% E3% 81%AE% E3% 83% 8F% E3% 83% AB% E3% 82% AB% E3% 82% B9 #% E6% A6% 82% E8% AA% AC) Considering that, I feel that moving the ward office and ward name to the neighboring Tennoji Ward, which has a certain rivalry, seems to be a disadvantage for the inhabitants.
On the contrary, Taisho Ward was a Minato Ward that was close to the main government building next to it last time, but this time it will move to Chuo Ward, which is larger. It is speculated that the approval rate has increased by getting out of the situation where the main government building is moved to an adjacent ward with a similar scale **.
Nishinari Ward (increased approval rate), Kita Ward (decreased approval rate), Chuo Ward (decreased), Nishi Ward (decreased) correspond.
It's not very clear, but it seems that some political change has occurred in the last five years, regardless of the change in classification. (While Kita-ku, Chuo-ku, and Nishi-ku, which have decreased approval rates, are among the top four wards with an average annual income per capita. Nishinari Ward, which has increased the approval rate, has the lowest annual income per capita. Is it a change related to income? Can be hypothesized)
It seems to be divided into the following 3 categories ** Geographical isolation from the ward office ** Minato Ward, (Konohana Ward)
** Main government building in a nearby ward ** Abeno Ward, Taisho Ward, (Asahi Ward)
** Some changes occur regardless of the change in classification (maybe it has something to do with income?) ** Nishinari Ward, Chuo Ward, Nishinari Ward, Kita Ward
Recommended Posts