The k-nearest neighbor method has two weighting functions (weighting methods) used for prediction. -** uniform: Uniform weight, all nearby points are weighted equally. ** ** -** distance: Weights the points as the reciprocal of the distance. Therefore, near points have more influence than distant points. ** **
The default is uniform, so you need to pass it as an argument when creating an instance to make it distance.

** How do these differences affect the forecast results? ** ** The case of the classification model of Last time is shown as an example.

The boundaries are slightly different, but there is a clear difference, especially in the red circles.
In general, distance is more faithful to the data.

** Furthermore, I would like to compare the case of the regression model. ** **

⑴ Import library

import numpy as np
import pandas as pd

# scikit-learn library
from sklearn.datasets import load_boston             #Boston Home Price Dataset
from sklearn.model_selection import train_test_split #Data split utility
from sklearn.neighbors import KNeighborsRegressor    # k-NR regression model method

#Visualization library
import matplotlib.pyplot as plt
import seaborn as sns

#Japanese display module of matplotlib
!pip install japanize-matplotlib
import japanize_matplotlib

The k-nearest neighbor method is mainly used as a classification model under the name of k-NN (k-Nearest Neighbor), but in the neighbors module of scikit-learn, the ** regression model is the KNeighborsRegressor ** method.

1. Prepare the data

Get "Boston" from the scikit-learn dataset.
A dataset of 506 samples with 13 attribute information such as crime rate, average number of rooms per house, and traffic access as explanatory variables, with "house price" in Boston, a large city in northeastern Massachusetts, as the objective variable.

⑵ Data acquisition and organization

#Get dataset
boston = load_boston()

#Convert explanatory variables to DataFrame
df = pd.DataFrame(boston.data, columns=boston.feature_names)

#Concatenate objective variables
df = pd.concat([df, pd.DataFrame(boston.target, columns=['MEDV'])], axis=1)
print(df)

The rightmost column "MEDV" is an abbreviation for median value, which corresponds to the objective variable (house price) defined as "the median value of homes in the $ 1000 range".
There are 13 explanatory variables in total, from the leftmost column "CRIM (crime rate): crime rate" to "LSTAT (lower status): lower class ratio", but for the sake of simplicity, we will select only one variable. I will.

(3) Examination of analysis axis by correlation matrix

Create a correlation matrix between all variables and focus on the variables that are highly correlated with the house price.

#Create a correlation matrix
correlation_matrix = np.corrcoef(df.T)

#Row / column labels
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 
         'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']

#Convert correlation matrix to DataFrame
correlation_df = pd.DataFrame(correlation_matrix, columns = names, index = names)

#Draw heatmap
plt.figure(figsize=(10,8))
sns.heatmap(correlation_df, annot=True, cmap='coolwarm')

Numpy's corrcoef () function is used for the correlation matrix, but the correlation between variables is calculated by transposing the rows and columns of the passed data with .T.
Seaborn is used for the heat map. The argument ʻannot = True of heatmap () `displays the value for each cell in the figure.

As variables that strongly correlate with house price (MEDV), "Underclass ratio (LSTAT)" shows a strong negative correlation at -0.74, and "Average number of rooms per unit (RM)" is strong positive at 0.70. Shows the correlation of.
Usually, the higher the number of rooms, the higher the price, and in areas with many low-income groups, the market price will be low. We will simply adopt the "average number of rooms per unit (RM)".

⑷ Data extraction and division

#Extract only 2 variables
df_extraction = df[['RM', 'MEDV']]

#Variable X,set y
X = np.array(df_extraction['RM'])
y = np.array(df_extraction['MEDV'])

X = X.reshape(len(X), 1) #Convert to 2D
y = y.reshape(len(y), 1)

#Data division for training / testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

X_train = X_train.reshape(len(X_train), 1) #Convert to 2D
X_test = X_test.reshape(len(X_test), 1)
y_train = y_train.reshape(len(y_train), 1)
y_test = y_test.reshape(len(y_test), 1)

Extract only the explanatory variables RM and the objective variable MEDV, and divide them into variables X and y for training and testing, respectively.

2. Examination of k parameters

⑸ Execute k-NR while changing the k parameter

Execute k-NR while changing k from 1 to 20, and observe the change in the accuracy rate of the training data and test data.

#Variable to store the correct answer rate
train_accuracy = []
test_accuracy = []

for k in range(1,21):
    kNR = KNeighborsRegressor(n_neighbors = k) #Instance generation
    kNR.fit(X_train, y_train) #Learning
    train_accuracy.append(kNR.score(X_train, y_train)) #Training accuracy rate
    test_accuracy.append(kNR.score(X_test, y_test)) #Test accuracy rate

#Convert correct answer rate to array
training_accuracy = np.array(train_accuracy)
test_accuracy = np.array(test_accuracy)

⑹ Select the optimum k parameter

Visualize changes in the accuracy rate between training and testing, and also show the difference in accuracy rate in a graph.

#Changes in the accuracy rate of training and tests
plt.figure(figsize=(6, 4))

plt.plot(range(1,21), train_accuracy, label='Training')
plt.plot(range(1,21), test_accuracy, label='test')

plt.xticks(np.arange(0, 21, 1)) #x-axis scale
plt.xlabel('k number')
plt.ylabel('Correct answer rate')
plt.title('Transition of correct answer rate')

plt.grid()
plt.legend()

#Transition of difference in correct answer rate
plt.figure(figsize=(6, 4))

difference = np.abs(train_accuracy - test_accuracy) #Calculate the difference
plt.plot(range(1,21), difference, label='Difference')

plt.xticks(np.arange(0, 21, 1)) #x-axis scale
plt.xlabel('k number')
plt.ylabel('Difference(train - test)')
plt.title('Transition of difference in correct answer rate')

plt.grid()
plt.legend()

plt.show()

As k increases, the accuracy rate of training decreases, and conversely, the number of tests increases, but both are almost flat from around k = 9.
Looking at the difference, k = 14 will be adopted because it is gradually decreasing until k = 14.

3. Model execution and evaluation

Create dummy data to be used for prediction.

⑺ Create dummy data

#Generate arithmetic progression
t = np.linspace(1, 10, 1000) #Starting value,End value,Element count

#Convert shape to 2D
T = t.reshape(1000, 1)

⑻ Execution and visualization of regression model

n_neighbors = 14

plt.figure(figsize=(12,5))

for i, w in enumerate(['uniform', 'distance']):
    model = KNeighborsRegressor(n_neighbors, weights=w)
    model = model.fit(X, y)
    y_ = model.predict(T)

    plt.subplot(1, 2, i + 1)
    plt.scatter(X, y, color='limegreen', label='data')
    plt.plot(T, y_, color='navy', lw=1, label='Predicted value')
    plt.legend()
    plt.title("weights = '%s'" % (w))

plt.tight_layout()
plt.show()

tight_layout () automatically adjusts the subplot parameters (axis scale, axis label, title range) so that the subplot fits snugly within the area of the graph.

At first glance, you can see that the distance changes significantly in the predicted value, and the data is overfitted.
On the other hand, uniform has a summary of information, and it is no wonder that the default is uniform.

2. Multivariate analysis spelled out in Python 8-2. K-nearest neighbor method [Weighting method] [Regression model]