I sometimes read a csv file without a column name and give it a column name, but I often forget how to do it, so make a note of it as a memorandum.
sorry. The content is really not a big deal.
The data used was the housing data published in the UCI machine learning repository. housing data
First, read the data. The data is separated by whitespace instead of commas, so specify whitespace in sep. Also, since housing.data does not have a column name, the data in the first row will be recognized as a column name when read normally, so specify header = None to avoid that.
import pandas as pd
df = pd.read_csv("housing.data", header=None, sep="\s+")
The result of reading the data is
It will be. Numbers from 0 to 13 are automatically assigned to become column names. Replace this automatically created column name with the original column name. First, create a dictionary (labels_dict) that associates the column name before conversion with the column name after conversion. If you specify labels_dict in the rename method of the data frame, the column names will be replaced according to the correspondence shown in the dictionary.
labels = ["CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS", "RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV"]
labels_dict = {num: label for num, label in enumerate(labels)}
df = df.rename(columns = labels_dict)
#Save the data frame with the column name added as a csv file.
df.to_csv("housing_data.csv", index=False)
If you check the inside of df after execution, you can see that the column name has been changed.
Since it's a big deal, let's use this data to roughly estimate the house price.
If you execute the following code, you can see that this data is all numerical data and there are no missing values. You can also display statistics. Please try it if you like.
from IPython.display import display
#Data type display
display(df.dtypes)
#Display of the number of missing values
display(df.isnull().sum())
#Displaying statistics
display(df.describe())
Normally, data is preprocessed while checking the statistics of the data, and then the data is input to the machine learning algorithm, but this time it will be omitted. What I said is that it's okay.
I'm omitting various things. After all, it's okay. At a minimum, we standardize the data and evaluate it with test data, but we do not adjust hyperparameters at all. The evaluation was simply based on mean square error (RMSE). The code is below.
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline
#Pipeline settings
pipe = Pipeline([
("scl", StandardScaler()),
("pca", PCA(n_components=10)),
("lr", LinearRegression(normalize=False))
])
#Data split
xtrain, xtest, ytrain, ytest = train_test_split(df[df.columns[df.columns != "MEDV"]], df["MEDV"], test_size=0.3, random_state=1)
#Model learning
pipe.fit(X=xtrain, y=ytrain)
#Price forecast
ypred = pipe.predict(xtest)
#Model evaluation
display(mean_squared_error(ytest, ypred))
#View results
result = pd.DataFrame(columns=["index", "true", "pred"])
result["index"] = range(len(ytest))
result["true"] = ytest.tolist()
result["pred"] = ypred
plt.figure(figsize=(15,5))
plt.scatter(result["index"], result["true"], marker="x", label="true")
plt.scatter(result["index"], result["pred"], marker="v", label="predict")
plt.xlabel("ID")
plt.ylabel("Median price")
plt.grid()
plt.legend()
plt.show()
When I did this, I got an average squared error of 21.19. I don't know if this is good or bad without looking at the data properly, but for the time being, I was able to evaluate the difference between the price forecast and the true value.
In addition, the predicted value and the true value are converted into grams as follows. At a glance, you can see that the higher the price, the larger the deviation, and the lower the predicted value.
Recommended Posts