This page is based on the assumption that it is standalone on Windows, but I wrote the latest version of the setup using Sakura's VPS. Look slowly! : relaxed: LAMP + WP + Py setup memo 2018
The latest version of 2018 (by VisualStudio) for building a standalone environment on Windows is now available! Look slowly! : relaxed: Notes on Python environment construction and SQL execution example to DB 2018
Since there are a lot of views, I will review all the contents. As expected, the number of Japanese pages has increased considerably after three years: relaxed:
Place of installation
C:\Python
https://www.python.org/downloads/windows/
Check "Add Python 3.x to PATH". Then the command prompt will recognize python. I chose Customize installation because I wanted to change the installation location.
When you press the start button and type "en" (= environment), "Edit system environment variables" will appear, so make sure that the installation location is recorded.
When you press the start button and type "cmd", the "command prompt" will appear. By setting the environment variable, the command python Nantoka can be passed.
Check the Python version
Microsoft Windows [Version 10.0.17134.523]
(c) 2018 Microsoft Corporation. All rights reserved.
C:\Users\yoshi> python -V
Python 3.7.2
List the libraries currently included (nothing is included)
C:\Users\yoshi> pip freeze
C:\Users\yoshi>
pyodbc is used for db connection pandas is a library that easily reads and analyzes csv and xls.
For the time being, put pandas and pyodbc
C:\Users\yoshi> python -m pip install --upgrade pip
C:\Users\yoshi> pip install pandas
C:\Users\yoshi> pip install pyodbc
List the libraries currently in
C:\Users\yoshi>pip freeze
numpy==1.16.1
pandas==0.24.0
pyodbc==4.0.25
python-dateutil==2.7.5
pytz==2018.9
six==1.12.0
The basics of MS-DOS. Move when running a python program
That's why I choose this directory as my work location
C:\Users\yoshi> cd/d "D:\OneDrive\document\Project\Python"
D:\OneDrive\document\Project\Python>
Move to the directory containing the program to be executed "cd C: \ xxxx \ xxxx .." You can execute the program by typing "python xxxx.py".
main.py
print("Hello world!")
You did it!
D:\OneDrive\document\Project\Python> python main.py
Hello world!
Summary of Python import For example, let's say you created a convenient library called "PyUtils.py" and stored it in a directory called "lib", right? Then you can call a convenient library (= user-defined function?) From main.py with the reserved word "Import", but there are two ways to do it. 2 is a shorter chanting, but which one is it?
Current directory status
D:\OneDrive\document\Project\Python> tree /F
D:.
│ main.py
└─ lib
PyUtils.py
main.py
import lib.PyUtils
print("Hello world!")
instance = lib.PyUtils.HelloClass()
instance.HelloWorldByPyUtils()
lib/PyUtils.py
class HelloClass:
def HelloWorldByPyUtils(self):
print("Hello PyUtils")
test
D:\OneDrive\document\Project\Python> python main.py
Hello world!
Hello PyUtils
main.py
from lib.PyUtils import *
print("Hello world!")
instance = HelloClass()
instance.HelloWorldByPyUtils()
lib/PyUtils.py
class HelloClass:
def HelloWorldByPyUtils(self):
print("Hello PyUtils")
test
D:\OneDrive\document\Project\Python> python main.py
Hello world!
Hello PyUtils
VisualStudioCode Really: relaxed: Ah, this is a nice name because it's made with a name generator (this generator is also useful with this)
(Put the one you can copy and paste)
Affiliation | Full name | Full name(Wonder) | mail address | Street address | birthday |
---|---|---|---|---|---|
recruit | Honda Mamoru | Honda protect | [email protected] | Hokkaido | 2019/2/1 |
recruit | Kengo Hayasaka | Hayasaka Kengo | [email protected] | Aomori Prefecture | 2019/2/2 |
recruit | Shin Tateishi | Proceed | [email protected] | Iwate Prefecture | 2019/2/3 |
recruit | Reina Washio | Takayuki Washio | [email protected] | Miyagi Prefecture | 2019/2/4 |
recruit | Takuji Guo | Lottery | [email protected] | Akita | 2019/2/5 |
recruit | Mikio Ogushi | Mikio Ogushi | [email protected] | Yamagata Prefecture | 2019/2/6 |
recruit | Norihiko Shimanuki | Shimanuki Norihiko | [email protected] | Fukushima Prefecture | 2019/2/7 |
recruit | Hidetaka Kojo | Hidetaka Kojo | [email protected] | Ibaraki Prefecture | 2019/2/8 |
recruit | Takeshi Kurei | Kujirai Kenji | [email protected] | Tochigi Prefecture | 2019/2/9 |
recruit | Genta Kurusu | Kurusu Genta | [email protected] | Gunma Prefecture | 2019/2/10 |
recruit | Takahiro Shioi | Shioi Takahiro | [email protected] | Saitama | 2019/2/11 |
recruit | Kiyoto Moroda | Kiyohito Morota | [email protected] | Chiba | 2019/2/12 |
recruit | Jinguji Yoshifumi | Jinguuji Yoshifumi | [email protected] | Tokyo | 2019/2/13 |
recruit | Ikemizu Ryo | Satori Ikemizu | [email protected] | Kanagawa Prefecture | 2019/2/14 |
recruit | Yusuke Kashii | Yusuke | [email protected] | Niigata Prefecture | 2019/2/15 |
recruit | Shin Shugo | Mashugo | [email protected] | Toyama Prefecture | 2019/2/16 |
recruit | Hirohiko Nishizono | Nishizono Hirohiko | [email protected] | Ishikawa Prefecture | 2019/2/17 |
recruit | Nobuhiro Kuroda | I did it Nobuhiro | [email protected] | Fukui prefecture | 2019/2/18 |
recruit | Masaharu Nezuka | Masaharu Nezuka | [email protected] | Yamanashi Prefecture | 2019/2/19 |
recruit | Nakamuro Ikki | Kazuki Nakamuro | [email protected] | Nagano Prefecture | 2019/2/20 |
recruit | Kenji Mimata | Mitsumata Kenji | [email protected] | Gifu Prefecture | 2019/2/21 |
recruit | Chihiro Mezaki | Chihiro Mesaki | [email protected] | Shizuoka Prefecture | 2019/2/22 |
recruit | Yugo Takita | Yugo Takida | [email protected] | Aichi prefecture | 2019/2/23 |
recruit | Eight truth | Makoto Yamoto | [email protected] | Mie Prefecture | 2019/2/24 |
recruit | Akihiro Ori | Orii Akihiro | [email protected] | Shiga Prefecture | 2019/2/25 |
recruit | Hirotaka Nomaru | No Umaru Hirotaka | [email protected] | Kyoto | 2019/2/26 |
recruit | Shun | Jun Yasushi | [email protected] | Osaka | 2019/2/27 |
recruit | Kojima Kiichiro | Kojima Kiichirou | [email protected] | Hyogo prefecture | 2019/2/28 |
recruit | Mitsuya Shimotsuura | Shimotsuura Mitsuya | [email protected] | Nara Prefecture | 2019/3/1 |
recruit | Takayuki Azuchi | And Takayuki | [email protected] | Wakayama Prefecture | 2019/3/2 |
db uses SQL Server The SQL Server engine and Manegement Studio (client software to operate) are different. SQL Server Engine Download
Username: python Password: python Postscript: When doing Django, check "db_ddladmin" here (CREATE TABLE etc. can be issued)
Even if you increase the number of users called python, you cannot log in unless you switch to SQL Server authentication mode. If you change this, don't forget to restart the server.
CREATETABLE.sql
CREATE TABLE Customer(
[Affiliation] NVARCHAR(20) NULL
, [Full name] NVARCHAR(20) NULL
, [Name (Kana)] NVARCHAR(20) NULL
, [mail address] NVARCHAR(20) NULL
, [Street address] TEXT NULL
, [birthday] NVARCHAR(20) NULL
);
main.py
import pyodbc
import pandas as pd
import lib.PyUtils
#Issue a data deletion query
SQL_TEMPLATE = "DELETE FROM [dbo].[client]" #SQL original
editSql = SQL_TEMPLATE #Replace the SQL original
lib.PyUtils.ExecuteSQLBySQLServer(editSql) #Issuing a DELETE statement
#CSV that contains data to plunge into db is UTF-8
df = pd.read_csv(r"data\customer.csv")
#Issue additional data query
SQL_TEMPLATE = "INSERT INTO [dbo].[client]([Affiliation],[Fullname],[Fullname(かな)],[mailaddress],[Streetaddress],[birthday])VALUES('{0}','{1}','{2}','{3}','{4}','{5}')"
for line in df.values:
editSql = SQL_TEMPLATE #SQL original
for i,col in enumerate(line): #Replace the SQL original
editSql = editSql.replace('{' + str(i) + '}', col)
lib.PyUtils.ExecuteSQLBySQLServer(editSql) #Issuing an INSERT statement
#Issue a select query
SQL_TEMPLATE = "SELECT * FROM [dbo].[client]" #SQL original
editSql = SQL_TEMPLATE #Replace the SQL original
df = lib.PyUtils.ReadQueryBySQLServer(editSql) #Issuing a SELECT statement
for line in df.values:
print(','.join(line)) #Cook and serve SQL results
lib/PyUtils.py
import pyodbc
import pandas as pd
def ExecuteSQLBySQLServer(sql):
con = pyodbc.connect(r'DRIVER={SQL Server};SERVER=localhost\SQLExpress;DATABASE=db;UID=python;PWD=python;')
cur = con.cursor()
cur.execute(sql)
con.commit()
con.close()
def ReadQueryBySQLServer(sql):
con = pyodbc.connect(r'DRIVER={SQL Server};SERVER=localhost\SQLExpress;DATABASE=db;UID=python;PWD=python;')
df = pd.io.sql.read_sql(sql,con)
con.close()
return(df)
test
D:\OneDrive\document\Project\Python> python main.py
recruit,Honda Mamoru,Honda protect,[email protected],Hokkaido,2019/2/1
recruit,Kengo Hayasaka,Hayasaka Kengo,[email protected],Aomori Prefecture,2019/2/2
recruit,Shin Tateishi,Proceed,[email protected],Iwate Prefecture,2019/2/3
recruit,Reina Washio,Takayuki Washio,[email protected],Miyagi Prefecture,2019/2/4
recruit,Takuji Guo,Lottery,[email protected],Akita,2019/2/5
recruit,Mikio Ogushi,Mikio Ogushi,[email protected],Yamagata Prefecture,2019/2/6
recruit,Norihiko Shimanuki,Shimanuki Norihiko,[email protected],Fukushima Prefecture,2019/2/7
recruit,Hidetaka Kojo,Hidetaka Kojo,[email protected],Ibaraki Prefecture,2019/2/8
recruit,Takeshi Kurei,Kujirai Kenji,[email protected],Tochigi Prefecture,2019/2/9
recruit,Genta Kurusu,Kurusu Genta,[email protected],Gunma Prefecture,2019/2/10
recruit,Takahiro Shioi,Shioi Takahiro,[email protected],Saitama,2019/2/11
recruit,Kiyoto Moroda,Kiyohito Morota,[email protected],Chiba,2019/2/12
recruit,Jinguji Yoshifumi,Jinguuji Yoshifumi,[email protected],Tokyo,2019/2/13
recruit,Ikemizu Ryo,Satori Ikemizu,[email protected],Kanagawa Prefecture,2019/2/14
recruit,Yusuke Kashii,Yusuke,[email protected],Niigata Prefecture,2019/2/15
recruit,Shin Shugo,Mashugo,[email protected],Toyama Prefecture,2019/2/16
recruit,Hirohiko Nishizono,Nishizono Hirohiko,[email protected],Ishikawa Prefecture,2019/2/17
recruit,Nobuhiro Kuroda,I did it Nobuhiro,[email protected],Fukui prefecture,2019/2/18
recruit,Masaharu Nezuka,Masaharu Nezuka,[email protected],Yamanashi Prefecture,2019/2/19
recruit,Nakamuro Ikki,Kazuki Nakamuro,[email protected],Nagano Prefecture,2019/2/20
recruit,Kenji Mimata,Mitsumata Kenji,[email protected],Gifu Prefecture,2019/2/21
recruit,Chihiro Mezaki,Chihiro Mesaki,[email protected],Shizuoka Prefecture,2019/2/22
recruit,Yugo Takita,Yugo Takida,[email protected],Aichi prefecture,2019/2/23
recruit,Eight truth,Makoto Yamoto,[email protected],Mie Prefecture,2019/2/24
recruit,Akihiro Ori,Orii Akihiro,[email protected],Shiga Prefecture,2019/2/25
recruit,Hirotaka Nomaru,No Umaru Hirotaka,[email protected],Kyoto,2019/2/26
recruit,Shun,Jun Yasushi,[email protected],Osaka,2019/2/27
recruit,Kojima Kiichiro,Kojima Kiichirou,[email protected],Hyogo prefecture,2019/2/28
recruit,Mitsuya Shimotsuura,Shimotsuura Mitsuya,[email protected],Nara Prefecture,2019/3/1
recruit,Takayuki Azuchi,And Takayuki,[email protected],Wakayama Prefecture,2019/3/2
D:\OneDrive\document\Project\Python>
A data frame is a "conceptual state in which array data can be processed like playing with a table in SQL". You can delete columns or replace NaN with 0. So once you get used to it, it's easier to use than an array. I often use it in R language, right? Data frames. Pandas Official Reference
main.py
import pyodbc
import pandas as pd
import lib.PyUtils
#Issue a data deletion query
SQL_TEMPLATE = "DELETE FROM [dbo].[client]" #SQL original
editSql = SQL_TEMPLATE #Replace the SQL original
lib.PyUtils.ExecuteSQLBySQLServer(editSql) #Issuing a DELETE statement
#CSV that contains data to plunge into db is UTF-8
df = pd.read_csv(r"data\customer.csv")
##################################
#Now process the data with pandas#
##################################
From here on, it's string processing, so I'll do it in the Python console I think there are three things that you can operate data frames on a business basis (the fourth one is also less frequent).
Python3.7Console
Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd
#Prepare sample data
>>> df = pd.DataFrame([np.nan, 2, np.nan, 0])
>>> df
0
0 NaN
1 2.0
2 NaN
3 0.0
# No.1 NaN(Missing value)Replaced with 0
>>> df = df.fillna(0)
>>> df
0
0 0.0
1 2.0
2 0.0
3 0.0
# No.Convert to 2 characters and pad with 5 digits zero
# astype(str) ...Type conversion to string
# .str. ...Accessor to str method(.You will be able to use commands such as split)
>>> df[0] = df[0].astype(str).str.zfill(5)
>>> df
0
0 000.0
1 002.0
2 000.0
3 000.0
# No.3 Convert to null that db understands
#There is no NULL (undefined), NaN cannot be used(Not Available, missing value)
#In the standard library, None corresponds to Null. If you want to put NULL in Sqlite3, pass None.
>>> df = pd.DataFrame([np.nan, 2, np.nan, 0])
>>> df = df.where(pd.notna(df), None)
>>> df
# No.4-column deletion (axis)(meaning:axis)When is set to 1, "of the column...」のmeaning合いになるようだ)
>>> df = df.drop([0], axis=1)
>>> df
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]
In the preprocessing of statistical data, there are things such as putting the average value and putting the median value in the missing value (≈Null value) for the time being.
For example, in Kaggle's Regression Analysis Page for Home Price Forecasts (https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard), median missing values by group. The part that you want to fill with. It seems to be super likely in business. I can think of a way to do it with SQL ...
Lot Frontage is "width of frontage (width of the front of the house)". What value should I fill in if LotFrontage is missing? One example is the idea that "neighborhood houses, that is, the'Neighborhood'will be about the same as the LotFrontage value of the same house." Therefore, this time, we will use "median LotFrontafe of houses with the same Neighborhood" as the supplementary value.
The cell-shaped one below is the median value of "Neighborhood". As soon as I find na, I want to fill it with the "median group by item" below.
Neighborhood | LotFrontage(median) |
---|---|
Blmngtn | 43 |
Blueste | 24 |
BrDale | 21 |
BrkSide | 52 |
ClearCr | 80 |
CollgCr | 70 |
Crawfor | 74 |
Edwards | 64.5 |
Gilbert | 65 |
IDOTRR | 60 |
MeadowV | 21 |
Mitchel | 73 |
NAmes | 73 |
NoRidge | 91 |
NPkVill | 24 |
NridgHt | 88.5 |
NWAmes | 80 |
OldTown | 60 |
Sawyer | 71 |
SawyerW | 66.5 |
Somerst | 73.5 |
StoneBr | 61.5 |
SWISU | 60 |
Timber | 85 |
Veenker | 68 |
Fill in with the median for na (however, different values are entered for each group)
#Take the LotFrontage median for each neighborhood house (Neighborhood) and fill it with that value.
print(data['LotFrontage'].isnull().sum()) #Missing value 259
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.median()))
print(data['LotFrontage'].isnull().sum()) #Missing value 0
See here for Python lambda Python lambda is easy to understand This is an image that "as soon as na is found while performing a full scan (= the entire amount is processed) vertically, it is filled with the median value of any group by item".
About lambda
#lambda formal argument:Return value
#Define the argument used in this lambda function as x
#x is the full amount of one of the items split by group
#After all, the median is filled for the total amount of na.
lambda x: x.fillna(x.median())
"Neighborhood" has 25 records starting with Blmngtn and ending with Veenker. When this is grouped, it is internally divided into 25 tables for each group (the total number of details is obtained by adding all the details of 25 tables). Then, the contents of the parentheses of the apply function are applied to each of the 25 tables. ** Well, what you're doing is the same as SQL ** It's written in O'Reilly's diagram that there are few steps ("apply" means "apply") once you understand the point of view. For more information, buy this book properly w It's pretty easy to understand.
This cheat sheet is written so easily, but after all statistics have a high first threshold. It's the same with general programming that you don't know what you're doing unless you copy and paste it after you understand it. I couldn't get statistics, so I tried Techacademy's data science course (¥ 139,000 for 4 weeks). Basically, [a new textbook on statistics learned with Python](https://www.amazon.co.jp/Python%E3%81%A7%E5%AD%A6%E3%81%B6%E3%81] % 82% E3% 81% 9F% E3% 82% 89% E3% 81% 97% E3% 81% 84% E7% B5% B1% E8% A8% 88% E5% AD% A6% E3% 81% AE % E6% 95% 99% E7% A7% 91% E6% 9B% B8-AI-TECHNOLOGY-% E9% A6% AC% E5% A0% B4-% E7% 9C% 9F% E5% 93% 89 / dp It will be a form that traces / 4798155063), but since you can learn how to fight in Kaggle, I can recommend it as an online teaching material. I was able to ask a lot of questions that weren't written in the book, such as the reverse question, "Which math do I have to go back to to solve this problem?" I'm doing my best on my own, but I'm sorry I ended up becoming a doctor shopping book. .. Where are you now? If you're like this, you'd better pay for it (if you take that step, you're actually in a dungeon where you've gone too high and you can ask sharp questions: muscle :). However, I can not do it, but while doing work, I can get 456 pages (all one book) in 4 weeks lol
The point of making a model is to make a mathematical function (make something like y = x ^ 2 and then the personal computer will perform the calculation by pushing the formula)
Cheat sheet 1 (Variable names are not controlled because it is a collection)
#◆ Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%precision 3 #3 decimal place notation
%matplotlib inline #jupyter only
sns.set()
#◆ Read data
data = pd.read_csv(r'data/kaggle_housing_price_revised.csv')
#◆ Confirmation 1 Data visualization
# 【pairplot】
sns.pairplot(data, hue='Category') #Hue color-codes each specified group
#[Scatter plot with histogram]
sns.jointplot(x = 'temperature', y = 'beer', data = beer, color = 'black')
#[Box plot]
sns.boxplot(x = 'Line', y = 'Density', data = data, color = 'gray')
#【Scatter plot】
fig, ax = plt.subplots()
ax.scatter(x = data['GrLivArea'], y = data['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
#◆ Confirmation 2 Graph the ratio of missing values to each data
all_data_na = (data.isnull().sum() / len(data)) * 100
#Delete records that contain all values (= 0) and sort them in descending order.
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.ylim([0,100]) #Fixed scale with maximum value set to 100
#◆ Fill in missing values (confirmation 2)
#Fill None:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
all_data[col] = all_data[col].fillna('None')
#0 fill:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
all_data[col] = all_data[col].fillna(0)
#Median fill:
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.median()))
#◆ Deletion of strongly correlated variables
threshold = 0.8
#Create and display a Correlation Matrix from data. At that time, apply the absolute value and convert the negative value to positive (corr)()Since only the numeric string is the operation target, corr_matrix.columns are reduced)
corr_matrix = data.corr().abs()
#The correlation matrix is converted to an upper triangular matrix because the information on the diagonal line is absolutely 1 and the information on the part below the diagonal line overlaps with the information on the upper part (corr)._matrix 0 or!Assuming a bool type with 0!It means that 0 was extracted by where)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
#Extract columns with one or more values greater than or equal to threshold for columns other than SalePrice from the upper triangular matrix
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
to_drop.remove('SalePrice')
data_dropped = data.drop(columns = to_drop)
#◆ Dummy variable
data_dummied = pd.get_dummies(data_dropped)
data_dummied.head()
#◆ Creating a model
#[Least squares method]
lm_model = smf.ols(formula = 'beer ~ temperature', data = beer).fit()
lm_model.summary()
#[Generalized linear model]
glm_model = smf.glm(formula = 'beer ~ temperature', data = data, family=sm.families.Binomial()).fit()
glm_model.summary()
#[AIC inspection] See if AIC is smaller than the NULL model
null_model = smf.ols('beer ~ 1', data = beer).fit()
print('null_model.aic: ', null_model.aic)
print('lm_model.aic: ', lm_model.aic)
print('If the explanatory variable of temperature is included, the AIC will be smaller, so the prediction accuracy will be higher.')
#◆ (Inspection 1) Regression straight line
sns.lmplot(x = 'temperature', y = 'beer', data = beer, scatter_kws = {'color': 'black'}, line_kws = {'color': 'black'})
#[Confirmation 1] The predicted value of 0 degree temperature is the intercept.(intercept)Match with
lm_model.predict(pd.DataFrame({'temperature':[0]}))
#[Confirmation 2] The predicted value of 20 degrees Celsius matches the position of the seaborn regression line.
lm_model.predict(pd.DataFrame({'temperature':[20]}))
#◆ (Inspection 2) Q using the selected model-Check the normality of the residuals on the Q plot
sm.qqplot(lm_model.resid, line = 's')
#Predicted value and actual SalePrice can be overlaid on a scatter plot
x = range(len(data))
y = data['SalePrice']
plt.scatter(x, y)
plt.scatter(x, lm_model.predict(data))
Random forest is easy to do, but the interpretation of "what happened" is important.
Cheat sheet 2 (Variable names are not controlled because it is a collection)
#Run Random Forest and calculate MSE with validation data
model_random_forest = RandomForestRegressor(criterion='mse').fit(X_train, y_train)
mean_squared_error(y_test, model_random_forest.predict(X_test))
> 1322565474.5486445
#What's this number! ?? If you think about it, you can check it with "the average difference between the verification data and the predicted value."
#◆ Get a sense of conviction by checking with mse (mean square error) vs. manual mse
np.mean((y_test - model_random_forest.predict(X_test))**2)
> 1322565474.5486445
#After making a model, y and y^Arrange in a scatter plot
fig, ax = plt.subplots()
ax.scatter(x = y_test, y = model_random_forest.predict(X_test))
plt.ylabel('predict', fontsize=13)
plt.xlabel('SalePrice', fontsize=13)
plt.show()
◆ Plot of random forest. SalePrice and predicted value are lined up on a straight line (predicted performance is reasonable)
Clustering is also easy to do, but the interpretation of "what happened" is important.
Cheat sheet 3 (Variable names are not controlled because it is a collection)
#Module loading
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
#Data read
dataset = pd.read_csv('data/baseball_salary.csv')
#Contents'batting average'When'RBI'Only in the column
dataset = dataset[['batting average', 'RBI']]
#Normalize with MinMaxScaler
dataset_scaled = MinMaxScaler().fit(dataset).transform(dataset)
#Draw a scatter plot of batting average and RBI after normalization
fig, ax = plt.subplots()
ax.scatter(x = dataset_scaled[:,0], y = dataset_scaled[:,1])
plt.ylabel('points', fontsize=13)
plt.xlabel('batting average', fontsize=13)
plt.show()
#Determine the number of clusters by looking at the elbow (6)...)
# WCSS:The sum of the distances from the node to the center of the shortest cluster (within)-cluster sum of squares)
wcss = []
n_comps = np.arange(1, 10)
for k in n_comps:
#Initialization of KMeans. init is the default"k-means++"
km = KMeans(n_clusters=k, init='k-means++').fit(dataset_scaled)
# inertia_The attribute contains the value of WCSS
wcss.append(km.inertia_)
fig = plt.figure(figsize=(6, 4))
plt.plot(n_comps, wcss)
plt.xlabel("# of clusters")
plt.ylabel("WCSS")
plt.show()
#Fit and plot on a scatter plot (for statement is cluster coloring)
km = KMeans(n_clusters=6, init='k-means++').fit(dataset_scaled)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
colors = ['dodgerblue', 'orange', 'green', 'tomato', 'blueviolet', 'cornflowerblue']
cluster_ids = list(set(km.labels_))
for k in range(len(cluster_ids)):
average = dataset_scaled[km.labels_ == k, 0]
points = dataset_scaled[km.labels_ == k, 1]
ax.scatter(average, points, c=colors[k], s=3, label=f'cluster = {cluster_ids[k]}')
ax.scatter(average.mean(), points.mean(), marker='*', c=colors[k], s=150)
ax.set_title('Clustering')
ax.set_xlabel('batting average')
ax.set_ylabel('points')
ax.legend(loc='upper right')
Recommended Posts