Kaggle ~ Housing Analysis ③ ~ Part1

1.First of all

This is the third time I have analyzed housing. Until the last time, the score was around 0.17, and even if I changed the model, it wouldn't grow any more.

This time, the standard process using CRISP-DM was used.

The data analysis process includes KDD, which focuses more on the data analysis part than CRISP-DM and CRIISP-DM as standard processes (KDD explanation is omitted this time).

The CRISP-DM process proceeds in the following order: (1) business understanding → (2) data understanding → (3) data preparation → (4) modeling → (5) evaluation → (6) application. image.png Figure 1 CRISP-DM

I would like to introduce what I have thought about these things. Since it is Part 1, I will introduce it multiple times.

2. Business understanding

The challenge in this competition is to predict the price of a home. So I imagined what factors would affect the price of a house.

==================== Imagination below ==================== ** Generally "location" Close to urban areas and train stations, convenient transportation, luxury homes ** ** "House size" Site area, number of floors, building size ** ** "Included" with pool, tennis court, etc. ** ** I feel that "new construction" or "used" is quite important (how old is important?) ** ** I think that "quality" is an important factor for materials. ** **

It's hard to mention, but I think it's very important for prediction.

3. Data understanding

Finally we will look at the contents of kaggle

# 1-1.Read data
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
df_train.head()

Output result

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
# 1-2.Check data structure
print(df_train.shape)
print(df_test.shape)
df_train.columns

Output result (1460, 81) (1459, 80) Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1', 'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual', 'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType', 'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual', 'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC', 'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType', 'SaleCondition', 'SalePrice'], dtype='object')

** There are 80 explanatory variables. ** **

This time, it is over because of space. Next time, we will finally perform the data preprocessing.

Recommended Posts

Kaggle ~ Housing Analysis ③ ~ Part1
Kaggle Summary: Redhat (Part 2)
Time series analysis part 4 VAR
Time series analysis Part 3 Forecast
Wrap analysis part1 (data preparation)
Time series analysis Part 1 Autocorrelation
Japanese analysis processing using Janome part1
Kaggle Summary: Instacart Market Basket Analysis
Multidimensional data analysis library xarray Part 2
Time series analysis Part 2 AR / MA / ARMA
[Python] First data analysis / machine learning (Kaggle)
Kaggle Memorandum ~ NLP with Disaster Tweets Part 1 ~
Kaggle: Introduction to Manual Feature Engineering Part 1