theme

There was a story about the job of operating a real estate system as a service, and that there is no loss in doing hands-on at the field level. That's why we decided to challenge the famous "House Price" problem of kaggle together. And I decided to post the contents that I read line by line to qiita because it will probably be useful later if I write it down properly. It's more of a memo than a commentary, but I hope it helps someone somewhere.

Original theme: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Referenced article: https://yolo-kiyoshi.com/2018/12/17/post-1003/

Today's work

Library preparation

I will explain each library one by one when I used it in my work, so I copied this as a spell once.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import (
    LinearRegression,
    Ridge,
    Lasso
)
%matplotlib inline

Data capture

The actual work starts from here. First, read and format the CSV file to be used. For the time being, copy this. Explain one by one.

#Data reading
train = pd.read_csv('train.csv') #Training data
test = pd.read_csv('test.csv') #test data
#Merge training data and test data
train['WhatIsData'] = 'Train'
test['WhatIsData'] = 'Test'
test['SalePrice'] = 9999999999
alldata = pd.concat([train,test],axis=0).reset_index(drop=True)
print('The size of train is : ' + str(train.shape))
print('The size of test is : ' + str(test.shape))

Read CSV file

Applicable source: train = pd.read_csv ('train.csv') #training data
Description: Using the pandas imported by "import pandas as pd", import the CSV file and store it in the variable "train". As a personal interpretation, pandas is an iron plate library used to easily process data spreadsheets.
Reference: https://dividable.net/programming/python-pandas/

Data is uniformly stored in the train variable column

Applicable source: train ['WhatIsData'] ='Train'
Description: Enter the items "What Is Data (item to distinguish whether it is derived from test or train)" and "Sale Price (data that is not originally in test)" that are in test but not in train.

Gatchan test data and train data

Applicable source: ʻalldata = pd.concat ([train, test], axis = 0) .reset_index (drop = True) `
Description: Prepare all data, which will be used later.
concat: Concatenate data. If axis is not specified, it will be concatenated like adding a line.
concat Reference: http://sinhrks.hatenablog.com/entry/2015/01/28/073327
index: Basic knowledge for handling data in the first place. A number that is arbitrarily assigned to process. You use it with reset_index.
index Reference: https://techacademy.jp/magazine/24150

Display the summary of the captured data

Applicable source: print ('The size of train is:' + str (train.shape))
print: Display the contents in () on the screen.
.shape: Outputs the outline of the dimensions and height and width of the array.
.shape Reference: http://www.kamishima.net/mlmpyja/nbayes2/shape.html

That's it.

That's all for today. I'll use it one hour a week to put it together, so it's a turtle-like speed, but thank you for your patronage.

[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)