Postscript 20180611

This page is based on the assumption that it is standalone on Windows, but I wrote the latest version of the setup using Sakura's VPS. Look slowly! : relaxed: LAMP + WP + Py setup memo 2018

Postscript 20180816

The latest version of 2018 (by VisualStudio) for building a standalone environment on Windows is now available! Look slowly! : relaxed: Notes on Python environment construction and SQL execution example to DB 2018

Postscript 20190201

Since there are a lot of views, I will review all the contents. As expected, the number of Japanese pages has increased considerably after three years: relaxed:

Reference page

-Install Python3

Install Python3

`Place of installation`


C:\Python

Download (64bit version)

https://www.python.org/downloads/windows/

Install after adding Path to the environment variable

Check "Add Python 3.x to PATH". Then the command prompt will recognize python. I chose Customize installation because I wanted to change the installation location.

Verification

Environment variables (super important)

When you press the start button and type "en" (= environment), "Edit system environment variables" will appear, so make sure that the installation location is recorded.

command prompt

When you press the start button and type "cmd", the "command prompt" will appear. By setting the environment variable, the command python Nantoka can be passed.

`Check the Python version`


Microsoft Windows [Version 10.0.17134.523]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\yoshi> python -V
  Python 3.7.2

`List the libraries currently included (nothing is included)`


C:\Users\yoshi> pip freeze
C:\Users\yoshi>

Frequency of use Installation of demon packages

pyodbc is used for db connection pandas is a library that easily reads and analyzes csv and xls.

`For the time being, put pandas and pyodbc`


C:\Users\yoshi> python -m pip install --upgrade pip
C:\Users\yoshi> pip install pandas
C:\Users\yoshi> pip install pyodbc

`List the libraries currently in`


C:\Users\yoshi>pip freeze
  numpy==1.16.1
  pandas==0.24.0
  pyodbc==4.0.25
  python-dateutil==2.7.5
  pytz==2018.9
  six==1.12.0

Move directory

The basics of MS-DOS. Move when running a python program

`That's why I choose this directory as my work location`


C:\Users\yoshi> cd/d "D:\OneDrive\document\Project\Python"
D:\OneDrive\document\Project\Python>

How to run the program

Move to the directory containing the program to be executed "cd C: \ xxxx \ xxxx .." You can execute the program by typing "python xxxx.py".

`main.py`


print("Hello world!")

`You did it!`


D:\OneDrive\document\Project\Python> python main.py
  Hello world!

If you make your own library

Summary of Python import For example, let's say you created a convenient library called "PyUtils.py" and stored it in a directory called "lib", right? Then you can call a convenient library (= user-defined function?) From main.py with the reserved word "Import", but there are two ways to do it. 2 is a shorter chanting, but which one is it?

`Current directory status`


D:\OneDrive\document\Project\Python> tree /F
  D:.
  │  main.py
  └─ lib
       PyUtils.py

How to import 1

`main.py`


import lib.PyUtils

print("Hello world!")
instance = lib.PyUtils.HelloClass()
instance.HelloWorldByPyUtils()

`lib/PyUtils.py`


class HelloClass:
    def HelloWorldByPyUtils(self):
        print("Hello PyUtils")

`test`


D:\OneDrive\document\Project\Python> python main.py
  Hello world!
  Hello PyUtils

How to import 2

`main.py`


from lib.PyUtils import *

print("Hello world!")
instance = HelloClass()
instance.HelloWorldByPyUtils()

`lib/PyUtils.py`


class HelloClass:
    def HelloWorldByPyUtils(self):
        print("Hello PyUtils")

`test`


D:\OneDrive\document\Project\Python> python main.py
  Hello world!
  Hello PyUtils

Create csv to read with UTF-8

VisualStudioCode Really: relaxed: Ah, this is a nice name because it's made with a name generator (this generator is also useful with this)

(Put the one you can copy and paste)

Affiliation	Full name	Full name(Wonder)	mail address	Street address	birthday
recruit	Honda Mamoru	Honda protect	[email protected]	Hokkaido	2019/2/1
recruit	Kengo Hayasaka	Hayasaka Kengo	[email protected]	Aomori Prefecture	2019/2/2
recruit	Shin Tateishi	Proceed	[email protected]	Iwate Prefecture	2019/2/3
recruit	Reina Washio	Takayuki Washio	[email protected]	Miyagi Prefecture	2019/2/4
recruit	Takuji Guo	Lottery	[email protected]	Akita	2019/2/5
recruit	Mikio Ogushi	Mikio Ogushi	[email protected]	Yamagata Prefecture	2019/2/6
recruit	Norihiko Shimanuki	Shimanuki Norihiko	[email protected]	Fukushima Prefecture	2019/2/7
recruit	Hidetaka Kojo	Hidetaka Kojo	[email protected]	Ibaraki Prefecture	2019/2/8
recruit	Takeshi Kurei	Kujirai Kenji	[email protected]	Tochigi Prefecture	2019/2/9
recruit	Genta Kurusu	Kurusu Genta	[email protected]	Gunma Prefecture	2019/2/10
recruit	Takahiro Shioi	Shioi Takahiro	[email protected]	Saitama	2019/2/11
recruit	Kiyoto Moroda	Kiyohito Morota	[email protected]	Chiba	2019/2/12
recruit	Jinguji Yoshifumi	Jinguuji Yoshifumi	[email protected]	Tokyo	2019/2/13
recruit	Ikemizu Ryo	Satori Ikemizu	[email protected]	Kanagawa Prefecture	2019/2/14
recruit	Yusuke Kashii	Yusuke	[email protected]	Niigata Prefecture	2019/2/15
recruit	Shin Shugo	Mashugo	[email protected]	Toyama Prefecture	2019/2/16
recruit	Hirohiko Nishizono	Nishizono Hirohiko	[email protected]	Ishikawa Prefecture	2019/2/17
recruit	Nobuhiro Kuroda	I did it Nobuhiro	[email protected]	Fukui prefecture	2019/2/18
recruit	Masaharu Nezuka	Masaharu Nezuka	[email protected]	Yamanashi Prefecture	2019/2/19
recruit	Nakamuro Ikki	Kazuki Nakamuro	[email protected]	Nagano Prefecture	2019/2/20
recruit	Kenji Mimata	Mitsumata Kenji	[email protected]	Gifu Prefecture	2019/2/21
recruit	Chihiro Mezaki	Chihiro Mesaki	[email protected]	Shizuoka Prefecture	2019/2/22
recruit	Yugo Takita	Yugo Takida	[email protected]	Aichi prefecture	2019/2/23
recruit	Eight truth	Makoto Yamoto	[email protected]	Mie Prefecture	2019/2/24
recruit	Akihiro Ori	Orii Akihiro	[email protected]	Shiga Prefecture	2019/2/25
recruit	Hirotaka Nomaru	No Umaru Hirotaka	[email protected]	Kyoto	2019/2/26
recruit	Shun	Jun Yasushi	[email protected]	Osaka	2019/2/27
recruit	Kojima Kiichiro	Kojima Kiichirou	[email protected]	Hyogo prefecture	2019/2/28
recruit	Mitsuya Shimotsuura	Shimotsuura Mitsuya	[email protected]	Nara Prefecture	2019/3/1
recruit	Takayuki Azuchi	And Takayuki	[email protected]	Wakayama Prefecture	2019/3/2

SQL Server setup

db uses SQL Server The SQL Server engine and Manegement Studio (client software to operate) are different. SQL Server Engine Download

SQL Server local connection

Add user

Username: python Password: python Postscript: When doing Django, check "db_ddladmin" here (CREATE TABLE etc. can be issued)

trouble shooting

Even if you increase the number of users called python, you cannot log in unless you switch to SQL Server authentication mode. If you change this, don't forget to restart the server.

db creation

Create table

`CREATETABLE.sql`


CREATE TABLE Customer(
	  [Affiliation] NVARCHAR(20) NULL
    , [Full name] NVARCHAR(20) NULL
    , [Name (Kana)] NVARCHAR(20) NULL
    , [mail address] NVARCHAR(20) NULL
    , [Street address] TEXT NULL
    , [birthday] NVARCHAR(20) NULL
);

Here is an example of a super specific db operation!

`main.py`


import pyodbc
import pandas as pd
import lib.PyUtils

#Issue a data deletion query
SQL_TEMPLATE = "DELETE FROM [dbo].[client]"       #SQL original
editSql = SQL_TEMPLATE                          #Replace the SQL original
lib.PyUtils.ExecuteSQLBySQLServer(editSql)      #Issuing a DELETE statement

#CSV that contains data to plunge into db is UTF-8
df = pd.read_csv(r"data\customer.csv")

#Issue additional data query
SQL_TEMPLATE = "INSERT INTO [dbo].[client]([Affiliation],[Fullname],[Fullname（かな）],[mailaddress],[Streetaddress],[birthday])VALUES('{0}','{1}','{2}','{3}','{4}','{5}')"
for line in df.values:
    editSql = SQL_TEMPLATE                      #SQL original
    for i,col in enumerate(line):               #Replace the SQL original
        editSql = editSql.replace('{' + str(i) + '}', col)
    lib.PyUtils.ExecuteSQLBySQLServer(editSql)  #Issuing an INSERT statement

#Issue a select query
SQL_TEMPLATE = "SELECT * FROM [dbo].[client]"     #SQL original
editSql = SQL_TEMPLATE                          #Replace the SQL original
df = lib.PyUtils.ReadQueryBySQLServer(editSql)  #Issuing a SELECT statement
for line in df.values:
    print(','.join(line))                       #Cook and serve SQL results

`lib/PyUtils.py`


import pyodbc
import pandas as pd

def ExecuteSQLBySQLServer(sql):
    con = pyodbc.connect(r'DRIVER={SQL Server};SERVER=localhost\SQLExpress;DATABASE=db;UID=python;PWD=python;')
    cur = con.cursor()
    cur.execute(sql)
    con.commit()
    con.close()

def ReadQueryBySQLServer(sql):
    con = pyodbc.connect(r'DRIVER={SQL Server};SERVER=localhost\SQLExpress;DATABASE=db;UID=python;PWD=python;')
    df = pd.io.sql.read_sql(sql,con)
    con.close()
    return(df)

Verification

`test`


D:\OneDrive\document\Project\Python> python main.py
recruit,Honda Mamoru,Honda protect,[email protected],Hokkaido,2019/2/1
recruit,Kengo Hayasaka,Hayasaka Kengo,[email protected],Aomori Prefecture,2019/2/2
recruit,Shin Tateishi,Proceed,[email protected],Iwate Prefecture,2019/2/3
recruit,Reina Washio,Takayuki Washio,[email protected],Miyagi Prefecture,2019/2/4
recruit,Takuji Guo,Lottery,[email protected],Akita,2019/2/5
recruit,Mikio Ogushi,Mikio Ogushi,[email protected],Yamagata Prefecture,2019/2/6
recruit,Norihiko Shimanuki,Shimanuki Norihiko,[email protected],Fukushima Prefecture,2019/2/7
recruit,Hidetaka Kojo,Hidetaka Kojo,[email protected],Ibaraki Prefecture,2019/2/8
recruit,Takeshi Kurei,Kujirai Kenji,[email protected],Tochigi Prefecture,2019/2/9
recruit,Genta Kurusu,Kurusu Genta,[email protected],Gunma Prefecture,2019/2/10
recruit,Takahiro Shioi,Shioi Takahiro,[email protected],Saitama,2019/2/11
recruit,Kiyoto Moroda,Kiyohito Morota,[email protected],Chiba,2019/2/12
recruit,Jinguji Yoshifumi,Jinguuji Yoshifumi,[email protected],Tokyo,2019/2/13
recruit,Ikemizu Ryo,Satori Ikemizu,[email protected],Kanagawa Prefecture,2019/2/14
recruit,Yusuke Kashii,Yusuke,[email protected],Niigata Prefecture,2019/2/15
recruit,Shin Shugo,Mashugo,[email protected],Toyama Prefecture,2019/2/16
recruit,Hirohiko Nishizono,Nishizono Hirohiko,[email protected],Ishikawa Prefecture,2019/2/17
recruit,Nobuhiro Kuroda,I did it Nobuhiro,[email protected],Fukui prefecture,2019/2/18
recruit,Masaharu Nezuka,Masaharu Nezuka,[email protected],Yamanashi Prefecture,2019/2/19
recruit,Nakamuro Ikki,Kazuki Nakamuro,[email protected],Nagano Prefecture,2019/2/20
recruit,Kenji Mimata,Mitsumata Kenji,[email protected],Gifu Prefecture,2019/2/21
recruit,Chihiro Mezaki,Chihiro Mesaki,[email protected],Shizuoka Prefecture,2019/2/22
recruit,Yugo Takita,Yugo Takida,[email protected],Aichi prefecture,2019/2/23
recruit,Eight truth,Makoto Yamoto,[email protected],Mie Prefecture,2019/2/24
recruit,Akihiro Ori,Orii Akihiro,[email protected],Shiga Prefecture,2019/2/25
recruit,Hirotaka Nomaru,No Umaru Hirotaka,[email protected],Kyoto,2019/2/26
recruit,Shun,Jun Yasushi,[email protected],Osaka,2019/2/27
recruit,Kojima Kiichiro,Kojima Kiichirou,[email protected],Hyogo prefecture,2019/2/28
recruit,Mitsuya Shimotsuura,Shimotsuura Mitsuya,[email protected],Nara Prefecture,2019/3/1
recruit,Takayuki Azuchi,And Takayuki,[email protected],Wakayama Prefecture,2019/3/2

D:\OneDrive\document\Project\Python>

pandas is convenient

A data frame is a "conceptual state in which array data can be processed like playing with a table in SQL". You can delete columns or replace NaN with 0. So once you get used to it, it's easier to use than an array. I often use it in R language, right? Data frames. Pandas Official Reference

`main.py`


import pyodbc
import pandas as pd
import lib.PyUtils

#Issue a data deletion query
SQL_TEMPLATE = "DELETE FROM [dbo].[client]"       #SQL original
editSql = SQL_TEMPLATE                          #Replace the SQL original
lib.PyUtils.ExecuteSQLBySQLServer(editSql)      #Issuing a DELETE statement

#CSV that contains data to plunge into db is UTF-8
df = pd.read_csv(r"data\customer.csv")

##################################
#Now process the data with pandas#
##################################

Data processing Lv1

From here on, it's string processing, so I'll do it in the Python console I think there are three things that you can operate data frames on a business basis (the fourth one is also less frequent).

`Python3.7Console`


Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy as np
>>> import pandas as pd

#Prepare sample data
>>> df = pd.DataFrame([np.nan, 2, np.nan, 0])
>>> df
             0
        0  NaN
        1  2.0
        2  NaN
        3  0.0

# No.1 NaN(Missing value)Replaced with 0
>>> df = df.fillna(0)
>>> df
             0
        0  0.0
        1  2.0
        2  0.0
        3  0.0

# No.Convert to 2 characters and pad with 5 digits zero
# astype(str)   ...Type conversion to string
# .str.         ...Accessor to str method(.You will be able to use commands such as split)
>>> df[0] = df[0].astype(str).str.zfill(5)
>>> df
               0
        0  000.0
        1  002.0
        2  000.0
        3  000.0

# No.3 Convert to null that db understands
#There is no NULL (undefined), NaN cannot be used(Not Available, missing value)
#In the standard library, None corresponds to Null. If you want to put NULL in Sqlite3, pass None.
>>> df = pd.DataFrame([np.nan, 2, np.nan, 0])
>>> df = df.where(pd.notna(df), None)
>>> df

# No.4-column deletion (axis)(meaning:axis)When is set to 1, "of the column...」のmeaning合いになるようだ）
>>> df = df.drop([0], axis=1)
>>> df
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3]

Data processing Lv2 Fill in missing values with the values specified for each group

In the preprocessing of statistical data, there are things such as putting the average value and putting the median value in the missing value (≈Null value) for the time being.

For example, in Kaggle's Regression Analysis Page for Home Price Forecasts (https://www.kaggle.com/serigne/stacked-regressions-top-4-on-leaderboard), median missing values by group. The part that you want to fill with. It seems to be super likely in business. I can think of a way to do it with SQL ...

Create a table or view after aggregation
Use Left Join to update only null items

Lot Frontage is "width of frontage (width of the front of the house)". What value should I fill in if LotFrontage is missing? One example is the idea that "neighborhood houses, that is, the'Neighborhood'will be about the same as the LotFrontage value of the same house." Therefore, this time, we will use "median LotFrontafe of houses with the same Neighborhood" as the supplementary value.

The cell-shaped one below is the median value of "Neighborhood". As soon as I find na, I want to fill it with the "median group by item" below.

Neighborhood	LotFrontage(median)
Blmngtn	43
Blueste	24
BrDale	21
BrkSide	52
ClearCr	80
CollgCr	70
Crawfor	74
Edwards	64.5
Gilbert	65
IDOTRR	60
MeadowV	21
Mitchel	73
NAmes	73
NoRidge	91
NPkVill	24
NridgHt	88.5
NWAmes	80
OldTown	60
Sawyer	71
SawyerW	66.5
Somerst	73.5
StoneBr	61.5
SWISU	60
Timber	85
Veenker	68

The mechanism of grouping is split-apply-combine

`Fill in with the median for na (however, different values are entered for each group)`


#Take the LotFrontage median for each neighborhood house (Neighborhood) and fill it with that value.
print(data['LotFrontage'].isnull().sum()) #Missing value 259
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.median()))
print(data['LotFrontage'].isnull().sum()) #Missing value 0

See here for Python lambda Python lambda is easy to understand This is an image that "as soon as na is found while performing a full scan (= the entire amount is processed) vertically, it is filled with the median value of any group by item".

`About lambda`


#lambda formal argument:Return value
#Define the argument used in this lambda function as x
#x is the full amount of one of the items split by group
#After all, the median is filled for the total amount of na.
lambda x: x.fillna(x.median())

"Neighborhood" has 25 records starting with Blmngtn and ending with Veenker. When this is grouped, it is internally divided into 25 tables for each group (the total number of details is obtained by adding all the details of 25 tables). Then, the contents of the parentheses of the apply function are applied to each of the 25 tables. ** Well, what you're doing is the same as SQL ** It's written in O'Reilly's diagram that there are few steps ("apply" means "apply") once you understand the point of view. For more information, buy this book properly w It's pretty easy to understand.

Statistical processing cheat sheet

This cheat sheet is written so easily, but after all statistics have a high first threshold. It's the same with general programming that you don't know what you're doing unless you copy and paste it after you understand it. I couldn't get statistics, so I tried Techacademy's data science course (¥ 139,000 for 4 weeks). Basically, [a new textbook on statistics learned with Python](https://www.amazon.co.jp/Python%E3%81%A7%E5%AD%A6%E3%81%B6%E3%81] % 82% E3% 81% 9F% E3% 82% 89% E3% 81% 97% E3% 81% 84% E7% B5% B1% E8% A8% 88% E5% AD% A6% E3% 81% AE % E6% 95% 99% E7% A7% 91% E6% 9B% B8-AI-TECHNOLOGY-% E9% A6% AC% E5% A0% B4-% E7% 9C% 9F% E5% 93% 89 / dp It will be a form that traces / 4798155063), but since you can learn how to fight in Kaggle, I can recommend it as an online teaching material. I was able to ask a lot of questions that weren't written in the book, such as the reverse question, "Which math do I have to go back to to solve this problem?" I'm doing my best on my own, but I'm sorry I ended up becoming a doctor shopping book. .. Where are you now? If you're like this, you'd better pay for it (if you take that step, you're actually in a dungeon where you've gone too high and you can ask sharp questions: muscle :). However, I can not do it, but while doing work, I can get 456 pages (all one book) in 4 weeks lol

Cheat 1 (Basic flow from data reading / visualization / missing value inspection / missing value filling / overcorrelation inspection / model creation / regression line / comprehensive inspection)

The point of making a model is to make a mathematical function (make something like y = x ^ 2 and then the personal computer will perform the calculation by pushing the formula)

`Cheat sheet 1 (Variable names are not controlled because it is a collection)`


#◆ Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%precision 3 #3 decimal place notation
%matplotlib inline #jupyter only
sns.set()

#◆ Read data
data = pd.read_csv(r'data/kaggle_housing_price_revised.csv')

#◆ Confirmation 1 Data visualization
# 【pairplot】
sns.pairplot(data, hue='Category')  #Hue color-codes each specified group
#[Scatter plot with histogram]
sns.jointplot(x = 'temperature', y = 'beer', data = beer, color = 'black')
#[Box plot]
sns.boxplot(x = 'Line', y = 'Density', data = data, color = 'gray')
#【Scatter plot】
fig, ax = plt.subplots()
ax.scatter(x = data['GrLivArea'], y = data['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

#◆ Confirmation 2 Graph the ratio of missing values to each data
all_data_na = (data.isnull().sum() / len(data)) * 100
#Delete records that contain all values (= 0) and sort them in descending order.
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)
f, ax = plt.subplots(figsize=(15, 12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.ylim([0,100]) #Fixed scale with maximum value set to 100

#◆ Fill in missing values (confirmation 2)
#Fill None:
for col in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[col] = all_data[col].fillna('None')
#0 fill:
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[col] = all_data[col].fillna(0)
#Median fill:
data["LotFrontage"] = data.groupby("Neighborhood")["LotFrontage"].apply(lambda x: x.fillna(x.median()))

#◆ Deletion of strongly correlated variables
threshold = 0.8
#Create and display a Correlation Matrix from data. At that time, apply the absolute value and convert the negative value to positive (corr)()Since only the numeric string is the operation target, corr_matrix.columns are reduced)
corr_matrix = data.corr().abs()
#The correlation matrix is converted to an upper triangular matrix because the information on the diagonal line is absolutely 1 and the information on the part below the diagonal line overlaps with the information on the upper part (corr)._matrix 0 or!Assuming a bool type with 0!It means that 0 was extracted by where)
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
#Extract columns with one or more values greater than or equal to threshold for columns other than SalePrice from the upper triangular matrix
to_drop = [column for column in upper.columns if any(upper[column] > threshold)]
print('There are %d columns to remove.' % (len(to_drop)))
to_drop.remove('SalePrice')
data_dropped = data.drop(columns = to_drop)

#◆ Dummy variable
data_dummied = pd.get_dummies(data_dropped)
data_dummied.head()

#◆ Creating a model
#[Least squares method]
lm_model = smf.ols(formula = 'beer ~ temperature', data = beer).fit()
lm_model.summary()
#[Generalized linear model]
glm_model = smf.glm(formula = 'beer ~ temperature', data = data, family=sm.families.Binomial()).fit()
glm_model.summary()
#[AIC inspection] See if AIC is smaller than the NULL model
null_model = smf.ols('beer ~ 1', data = beer).fit()
print('null_model.aic: ', null_model.aic)
print('lm_model.aic: ', lm_model.aic)
print('If the explanatory variable of temperature is included, the AIC will be smaller, so the prediction accuracy will be higher.')

#◆ (Inspection 1) Regression straight line
sns.lmplot(x = 'temperature', y = 'beer', data = beer, scatter_kws = {'color': 'black'}, line_kws = {'color': 'black'})
#[Confirmation 1] The predicted value of 0 degree temperature is the intercept.(intercept)Match with
lm_model.predict(pd.DataFrame({'temperature':[0]}))
#[Confirmation 2] The predicted value of 20 degrees Celsius matches the position of the seaborn regression line.
lm_model.predict(pd.DataFrame({'temperature':[20]}))

#◆ (Inspection 2) Q using the selected model-Check the normality of the residuals on the Q plot
sm.qqplot(lm_model.resid, line = 's')
#Predicted value and actual SalePrice can be overlaid on a scatter plot
x = range(len(data))
y = data['SalePrice']
plt.scatter(x, y)
plt.scatter(x, lm_model.predict(data))

Cheat 2 (Random Forest)

Random forest is easy to do, but the interpretation of "what happened" is important.

`Cheat sheet 2 (Variable names are not controlled because it is a collection)`


#Run Random Forest and calculate MSE with validation data
model_random_forest = RandomForestRegressor(criterion='mse').fit(X_train, y_train)
mean_squared_error(y_test, model_random_forest.predict(X_test))
> 1322565474.5486445
#What's this number! ?? If you think about it, you can check it with "the average difference between the verification data and the predicted value."
#◆ Get a sense of conviction by checking with mse (mean square error) vs. manual mse
np.mean((y_test - model_random_forest.predict(X_test))**2)
> 1322565474.5486445
#After making a model, y and y^Arrange in a scatter plot
fig, ax = plt.subplots()
ax.scatter(x = y_test, y = model_random_forest.predict(X_test))
plt.ylabel('predict', fontsize=13)
plt.xlabel('SalePrice', fontsize=13)
plt.show()

◆ Plot of random forest. SalePrice and predicted value are lined up on a straight line (predicted performance is reasonable)

Cheat 3 (clustering)

Clustering is also easy to do, but the interpretation of "what happened" is important.

`Cheat sheet 3 (Variable names are not controlled because it is a collection)`


#Module loading
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

#Data read
dataset = pd.read_csv('data/baseball_salary.csv')

#Contents'batting average'When'RBI'Only in the column
dataset = dataset[['batting average', 'RBI']]

#Normalize with MinMaxScaler
dataset_scaled = MinMaxScaler().fit(dataset).transform(dataset)

#Draw a scatter plot of batting average and RBI after normalization
fig, ax = plt.subplots()
ax.scatter(x = dataset_scaled[:,0], y = dataset_scaled[:,1])
plt.ylabel('points', fontsize=13)
plt.xlabel('batting average', fontsize=13)
plt.show()

#Determine the number of clusters by looking at the elbow (6)...）
# WCSS:The sum of the distances from the node to the center of the shortest cluster (within)-cluster sum of squares）
wcss = []
n_comps = np.arange(1, 10)
for k in n_comps:
    #Initialization of KMeans. init is the default"k-means++"
    km = KMeans(n_clusters=k, init='k-means++').fit(dataset_scaled)
    # inertia_The attribute contains the value of WCSS
    wcss.append(km.inertia_)
fig = plt.figure(figsize=(6, 4))
plt.plot(n_comps, wcss)
plt.xlabel("# of clusters")
plt.ylabel("WCSS")
plt.show()

#Fit and plot on a scatter plot (for statement is cluster coloring)
km = KMeans(n_clusters=6, init='k-means++').fit(dataset_scaled)
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
colors = ['dodgerblue', 'orange', 'green', 'tomato', 'blueviolet', 'cornflowerblue']
cluster_ids = list(set(km.labels_))
for k in range(len(cluster_ids)):
    average = dataset_scaled[km.labels_ == k, 0]
    points = dataset_scaled[km.labels_ == k, 1]
    ax.scatter(average, points, c=colors[k], s=3, label=f'cluster = {cluster_ids[k]}')
    ax.scatter(average.mean(), points.mean(), marker='*', c=colors[k], s=150)
ax.set_title('Clustering')
ax.set_xlabel('batting average')
ax.set_ylabel('points')
ax.legend(loc='upper right')

Python environment construction and SQL execution example to DB and memo of basic processing for statistics 2019

Postscript 20180611

Postscript 20180816

Postscript 20190201

Reference page

Install Python3

Place of installation

Download (64bit version)

Install after adding Path to the environment variable

Verification

Environment variables (super important)

command prompt

Check the Python version

List the libraries currently included (nothing is included)

Frequency of use Installation of demon packages

For the time being, put pandas and pyodbc

List the libraries currently in

Move directory

That's why I choose this directory as my work location

How to run the program

main.py

You did it!

If you make your own library

Current directory status

How to import 1

main.py

lib/PyUtils.py

test

How to import 2

main.py

lib/PyUtils.py

test

Create csv to read with UTF-8

SQL Server setup

SQL Server local connection

Add user

trouble shooting

db creation

Create table

CREATETABLE.sql

Here is an example of a super specific db operation!

main.py

lib/PyUtils.py

Verification

test

pandas is convenient

main.py

Data processing Lv1

Python3.7Console

Data processing Lv2 Fill in missing values with the values specified for each group

The mechanism of grouping is split-apply-combine

Fill in with the median for na (however, different values are entered for each group)

About lambda

Statistical processing cheat sheet

Cheat 1 (Basic flow from data reading / visualization / missing value inspection / missing value filling / overcorrelation inspection / model creation / regression line / comprehensive inspection)

Cheat sheet 1 (Variable names are not controlled because it is a collection)

Cheat 2 (Random Forest)

Cheat sheet 2 (Variable names are not controlled because it is a collection)

Cheat 3 (clustering)

Cheat sheet 3 (Variable names are not controlled because it is a collection)

`Place of installation`

`Check the Python version`

`List the libraries currently included (nothing is included)`

`For the time being, put pandas and pyodbc`

`List the libraries currently in`

`That's why I choose this directory as my work location`

`main.py`

`You did it!`

`Current directory status`

`main.py`

`lib/PyUtils.py`

`test`

`main.py`

`lib/PyUtils.py`

`test`

`CREATETABLE.sql`

`main.py`

`lib/PyUtils.py`

`test`

`main.py`

`Python3.7Console`

`Fill in with the median for na (however, different values are entered for each group)`

`About lambda`

`Cheat sheet 1 (Variable names are not controlled because it is a collection)`

`Cheat sheet 2 (Variable names are not controlled because it is a collection)`

`Cheat sheet 3 (Variable names are not controlled because it is a collection)`