Click here until yesterday
You will become an engineer in 100 days --Day 76 --Programming --About machine learning
You will become an engineer in 100 days-Day 70-Programming-About scraping
You will become an engineer in 100 days --Day 66 --Programming --About natural language processing
You will become an engineer in 100 days --Day 63 --Programming --Probability 1
You will become an engineer in 100 days-Day 59-Programming-Algorithms
You will become an engineer in 100 days --- Day 53 --Git --About Git
You will become an engineer in 100 days --Day 42 --Cloud --About cloud services
You will become an engineer in 100 days --Day 36 --Database --About the database
You will be an engineer in 100 days-Day 24-Python-Basics of Python language 1
You will become an engineer in 100 days --Day 18 --Javascript --JavaScript basics 1
You will become an engineer in 100 days --Day 14 --CSS --CSS Basics 1
You will become an engineer in 100 days --Day 6 --HTML --HTML basics 1
This time is a continuation of the story about machine learning.
Since I did the last time, I roughly did what I could do and what I could do. Today I would like to do what I should do specifically.
First, let's talk about the flow of machine learning. I will explain how machine learning is done in business.
The flow of work when incorporating machine learning is as follows.
Let's look at the specific content.
I think it's the most important place. What do you do machine learning for and what you want to do Decide the purpose.
The only thing you can do with machine learning is prediction
.
That prediction
Regression
: Predict numerical values
Category
: Predict categories such as men and women
Clustering
: Divide into good feelings
You can only do three things.
First of all, we have to decide what to predict for the purpose of machine learning.
A good example is:
I want to predict the user withdrawal rate
because I want to reduce the user withdrawal rate.
I want to increase sales, so I want to predict if the user will buy.
Predict the XX
that leads to it for 〇〇
I think that would be the correct way to introduce machine learning.
Basically
Sales
and Profit
I think whether it is directly connected here.
I don't know if it will lead to here, something that is difficult to judge It means that it is not a good idea to let machine learning.
In the first place, machine learning requires a huge amount of man-hours for the subsequent work. It costs a lot of money.
The development cost is estimated to be 30 million yen, but there is almost no profit to generate If so, it is wise to decide not to do it because it is useless to do it.
I want to verify how accurate it will be if I do machine learning. This is OK.
Whether the experiment works or not until the verification as POC If the purpose is loose, you can use the verification result as a result. I think it may be for the purpose of verification.
Most of the time, you just throw your money away.
Once the purpose is decided, it is necessary to create the data accordingly.
If you have already acquired the data and want to use it, just send and receive the data Very few words.
However, there is no data yet, and it will be difficult to start acquiring data from now on. Make sure to design the data to see what kind of data will work. We have to start by creating a mechanism that allows us to acquire data without excess or deficiency.
All you have to do is check to see if your clients and themselves have the right data, and if so, decide how to send and receive the data. If there is no data, it is from the design examination of data acquisition.
Regarding sending and receiving data, you can receive it on HDD or SSD, or you can receive it via cloud storage. I think it's mostly via cloud storage these days.
This is the pre-process of what is called data pre-processing
.
What kind of data do you have, what kind of data composition ratio do you have, and how much? We perform basic aggregation processing such as, and analyze and visualize the data.
Then, we will select data that can be used.
When it comes to large data, it takes many days just to grasp the data. If you don't understand the characteristics of the data properly here, the work will go back and forth.
Now, let's create data for machine learning from here. After narrowing down the data candidates that can be used to some extent, we will make the data usable for machine learning.
Only tabular data that can be used for machine learning must be grouped together.
Supervised machine learning requires data for the explanatory variable
to be used for learning and the correct label
to represent the correct answer.
It is necessary to process the part that says what you want to predict into a column of correct label
.
Also, all the data used as explanatory variables
must be converted to numerical values.
The work here is the process of data preprocessing.
There are few machine learning projects that say that the data is pretty clean. There are few data that require almost no preprocessing Unless it is designed to collect data neatly at the stage of data acquisition You spend time processing the data.
In machine learning, we will combine the data into one tabular format. If there are many types of data, it is necessary to devise ways to put them together.
Usually, you will end up with thousands to tens of thousands of columns of data.
The number of lines varies considerably depending on the type of business and the data acquisition mechanism. Don't worry too much, but if you have a small number of lines It may affect the accuracy.
As a result of preprocessing, there are 2 million lines that can be used 20 lines Then, there is a difference in accuracy.
Creating a model basically does not require so much man-hours. Even if you make many models, not all models will be used.
All you have to do is make one model that is accurate and usable.
You have to try a lot of techniques to create a model If you do it to a certain extent, it will be decided that this method is good, and if you select a method mechanically, you only have to try 10 types of methods at once and wait for the result, so it takes a lot of effort. not.
For services like DataRobot
if you have the data at hand
It will automatically create various models using that data.
Creating a model is now very easy and not time consuming This is a relatively small part in terms of machine learning man-hours.
It will be done as a set with model creation, but we will make it while verifying how accurate it is after making a model.
There are several methods of accuracy verification, but in general, we will look at how much error occurred
.
Since the one with less error
is said to be a good model, I think that the models will be arranged in the order of less error, and finally one of the models with higher accuracy will be adopted.
Until a good machine learning flow is made
Will be repeated, and if the accuracy is still satisfactory
You may have to start over.
A good model can only come from good data. Garbage data is nothing more than garbage.
It is quite rare that a treasure trove is mixed in the garbage data.
When the data has been prepared and it is found that the accuracy is reasonable, we will finally incorporate the model into the system.
Generally, if it is a WEB service, it is incorporated so that it is provided as a part of the function on the back end side.
It will be a form of building a system while considering how to operate it and how much it will cost, together with the system requirements.
It ’s a machine learning service for AWS sagemaker. Some of them will be available as endpoints immediately. Using such a service will reduce the number of implementations.
After learning what you can do with machine learning, it's a good idea to learn what to do to do machine learning.
The general flow is the same, so I think it's best to refer to the methods of various companies.
23 days until you become an engineer
Otsu py's HP: http://www.otupy.net/
Youtube: https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw
Twitter: https://twitter.com/otupython
Recommended Posts