When I touch Python, I often hear that there are a lot of machine learning libraries. I knew it existed, but I had never actually moved my hand. However, it seems easy to read this article! I thought that I tried machine learning, especially SVM (Support Vector Machine), so I will post it.
Here, we will do everything from weather ** data acquisition to very simple data processing, learning, and visualization **.
I mainly referred to the following two articles. [Python] Easy introduction to machine learning with python (SVM) [Python for beginners in machine learning] Easy implementation of SVM with scikit-learn
I'm running on Anaconda on Windows 10.
name | version |
---|---|
Python | 3.7.3 |
Scikit-learn | 0.23.1 |
Pandas | 1.0.5 |
Numpy | 1.18.5 |
matplotlib | 3.2.2 |
mlxtend | 0.17.3 |
Each can be installed with pip as shown below.
$ pip install scikit-learn
$ pip isntall pandas
$ pip isntall numpy
$ pip install matplotlib
$ pip install mlxtend
It is on the same level as the two articles listed above. --Can handle Python, Numpy and Pandas ――I know about the existence of machine learning --I want to know the flow of SVM implementation
I won't go into details here. Please refer to the reference articles.
What is machine learning classification?
Classification In the classification task, a finite number of predetermined classes are defined, and each class is assigned a class name called a class label (or simply a label) such as "cat" or "dog". The purpose of the classification task is to guess which of the given inputs x belongs to. [Machine Learning Classification-Wikipedia](https://ja.wikipedia.org/wiki/%E6%A9%9F%E6%A2%B0%E5%AD%A6%E7%BF%92#%E5%88% 86% E9% A1% 9E)
Here, machine learning uses data such as temperature, precipitation, and cloud cover to guess the weather (label).
In addition, SVM is as follows.
The support vector machine (SVM) is one of the pattern recognition models that uses supervised learning. Applicable to classification and regression. SVM - Wikipedia
Here, the explanation follows the following flow.
Since we decided to handle weather data this time, let's download it from this page of the Japan Meteorological Agency.
Select the location, item (temperature, precipitation, etc.) and period to download. Feel free to download items such as temperature and precipitation. I don't think it's a problem because you can make a selection when using it for learning. You should be able to download a csv file named data.csv.
When I tried it myself for a while, for example, from the data for the year 2019, it was 12 months from October-November 2001, October-November 2002, October-November 2003, and so on. The data seems to be better classified. (I think you can understand that it is better to study with the data of the same period)
From here, we will process the data using pandas etc.
The downloaded file is on the first line
Download time: 2020/11/16 18:18:28
Since there is data called header = 2
to avoid it and read it, and because it includes Japanese, it is set toencoding = "SHIFT-JIS"
.
import numpy as np
import pandas as pd
#Read csv file(data.csv is your own directory/Please match with the file name)
df = pd.read_csv("data.csv", header = 2,encoding="SHIFT-JIS")
I think the df at this point is as follows.
Since there is the same column name, column names such as ". 1 "
exist. You can see that this is because there are columns for quality number, homogeneous number, etc. in the 0th row. I don't need it this time, so let's delete it.
It's a little forcible, but I did the following. Delete the rows that have missing values.
#Drop line 0
df = df.drop(df.index[[0]])
# ".1", ".2", ".3"Drops the column at the end of the column name
df = df.drop(df.loc[:, df.columns.str.endswith(".1")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".2")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".3")], axis = 1)
#Delete rows with missing values
df = df.dropna(how='all')
I think it's clean now!
Now let's look at the unique number of labels (weather overview). The data I downloaded was a whopping 64. There are too many.
print(len(df["Weather overview(Noon: 06:00 to 18:00)"].unique().tolist()))
It's a part, but it looks like this.
['Partially cloudy', 'Temporary rain after cloudy weather', 'Cloudy', '晴後Cloudy', 'Cloudy後雨', '晴後薄Cloudy', '雨一時Cloudy',
'Fine', 'Cloudy and sometimes rain','曇一時Fine', 'rain', '曇後Fine', 'rain時々曇', '曇一時rain', '快Fine',
'Sunny after rain', 'Temporarily cloudy after fine weather', 'Light cloud','Temporary sunny after rain', 'Cloudy after rain', 'Temporary clear after cloudy',
'Sunny and cloudy', 'With rain and thunder', '晴後一時With rain and thunder', 'heavy rain','Cloudy and sunny after a temporary rain',
'Cloudy temporary fog', 'Light cloudy temporary clear', 'Cloudy and sometimes sunny', 'Cloudy after rain', 'Sunny Temporary cloudy', 'Cloudy temporary rain, accompanied by lightning']
** This time, ** I took the following actions to simplify the classification. ① Get the first letter. 2 Replace with a number
df["Weather overview(Noon: 06:00 to 18:00)"] = df["Weather overview(Noon: 06:00 to 18:00)"].str[:1]
df["Weather No"] = df["Weather overview(Noon: 06:00 to 18:00)"].str.replace("Cloudy","0").replace("Fine", "1").replace("Big", "3").replace("rain", "3").replace("Thin", "0").replace("Pleasant", "1").replace("fog","0")
A list of replaced numbers and weather. This time it's very simplified. If you don't like it, of course, change it yourself.
weather(1st character) | Numerical value | Original notation |
---|---|---|
Cloudy | 0 | Cloudy(Somehow temporarily etc.) |
Thin | 0 | Light cloud(Somehow temporarily etc.) |
fog | 0 | fog(Somehow temporarily etc.) |
Pleasant | 1 | Sunny(Somehow temporarily etc.) |
Fine | 1 | Fine(Somehow temporarily etc.) |
rain | 2 | rain(Somehow temporarily etc.) |
Big | 2 | heavy rain(Somehow temporarily etc.) |
It's finally learning. This time, we use a function called train_test_split
to divide the data into training data and test data in order to verify whether we can learn well and predict both training data and unknown data.
The learning itself is very easy
model.fit(x_train, y_train)
Only this line.
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
#Storage of explanatory variables
x = df.loc[1:, ["Total precipitation(mm)","Average cloud cover(10 minutes ratio)"]]
#Storage of objective variable
y = df.loc[1:,"Weather No"].astype("int64")
#Divided into training data and test data.
# test_size=0.3 :Test data is 30%, Training data: 70%
# random_state=None: Generate different data each time
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=None )
#Select SVM
model = svm.SVC()
#Learning
model.fit(x_train, y_train)
#Accuracy for training data
pred_train = model.predict(x_train)
accuracy_train = accuracy_score(y_train, pred_train)
print('Correct answer rate for training data:%.2f' % accuracy_train)
#Accuracy to test data
pred_test = model.predict(x_test)
accuracy_test = accuracy_score(y_test, pred_test)
print('Correct answer rate for test data:%.2f' % accuracy_test)
If you get the following results, you are successful!
Correct answer rate for training data: 0.81
Correct answer rate for test data: 0.82
Since the classification is done by model.predict ()
, the classification result will be returned even below the extreme theory.
model.predict([[1,1]])
Finally, visualization.
Visualize decision boundaries with plot_decision_regions
.
plot_decision_regions
is also easy to use and will create a graph for you if you pass in the data and model.
However, the x passed here is two-dimensional. Each applies to the x and y axes of the resulting graph.
#Visualization of decision boundaries
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
x_combined = x_test.values
y_combined = y_test.values
fig = plt.figure(figsize=(13,8))
plot_decision_regions(x_combined, y_combined, clf=model, res=0.02)
plt.show()
In my case, the following figure came out. Go ...?
import numpy as np
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
#Read csv file(data.csv is your own directory/Please match with the file name)
df = pd.read_csv("data.csv", header = 2,encoding="SHIFT-JIS")
#Drop line 0
df = df.drop(df.index[[0]])
# ".1", ".2", ".3"Drops the column at the end of the column name
df = df.drop(df.loc[:, df.columns.str.endswith(".1")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".2")], axis = 1)
df = df.drop(df.loc[:, df.columns.str.endswith(".3")], axis = 1)
#Delete rows with missing values
df = df.dropna(how='all')
#Label processing
df["Weather overview(Noon: 06:00 to 18:00)"] = df["Weather overview(Noon: 06:00 to 18:00)"].str[:1]
df["Weather No"] = df["Weather overview(Noon: 06:00 to 18:00)"].str.replace("Cloudy","0").replace("Fine", "1").replace("Big", "3").replace("rain", "3").replace("Thin", "0").replace("Pleasant", "1").replace("fog","0")
#Storage of explanatory variables
x = df.loc[1:, ["Total precipitation(mm)","Average cloud cover(10 minutes ratio)"]]
#Storage of objective variable
y = df.loc[1:,"Weather No"].astype("int64")
#Divided into training data and test data.
# test_size=0.3 :Test data is 30%, Training data: 70%
# random_state=None: Generate different data each time
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=None )
#Select SVM
model = svm.SVC()
#Learning
model.fit(x_train, y_train)
#Accuracy for training data
pred_train = model.predict(x_train)
accuracy_train = accuracy_score(y_train, pred_train)
print('Correct answer rate for training data:%.2f' % accuracy_train)
#Accuracy to test data
pred_test = model.predict(x_test)
accuracy_test = accuracy_score(y_test, pred_test)
print('Correct answer rate for test data:%.2f' % accuracy_test)
#Visualization of decision boundaries
x_combined = x_test.values
y_combined = y_test.values
fig = plt.figure(figsize=(13,8))
plot_decision_regions(x_combined, y_combined, clf=model, res=0.02)
plt.show()
I first encountered machine learning, but it was surprisingly easy! (Although there were various miscellaneous parts,) I was also inspired by this article, so please try it.
Recommended Posts