You will be an engineer in 100 days ――Day 77 ――Programming ――About machine learning 2

Click here until yesterday

This time is a continuation of the story about machine learning.

About the flow of machine learning

Since I did the last time, I roughly did what I could do and what I could do. Today I would like to do what I should do specifically.

First, let's talk about the flow of machine learning. I will explain how machine learning is done in business.

The flow of work when incorporating machine learning is as follows.

Determine the purpose
Data acquisition
Data understanding / selection / processing
Data mart (data set) creation
Model creation
Accuracy verification
System implementation

Let's look at the specific content.

0. Determine the purpose

I think it's the most important place. What do you do machine learning for and what you want to do Decide the purpose.

The only thing you can do with machine learning is prediction.

That prediction Regression: Predict numerical values Category: Predict categories such as men and women Clustering: Divide into good feelings

You can only do three things.

First of all, we have to decide what to predict for the purpose of machine learning.

A good example is: I want to predict the user withdrawal rate because I want to reduce the user withdrawal rate. I want to increase sales, so I want to predict if the user will buy.

Predict the XX that leads to it for 〇〇 I think that would be the correct way to introduce machine learning.

Basically Sales and Profit I think whether it is directly connected here.

I don't know if it will lead to here, something that is difficult to judge It means that it is not a good idea to let machine learning.

In the first place, machine learning requires a huge amount of man-hours for the subsequent work. It costs a lot of money.

The development cost is estimated to be 30 million yen, but there is almost no profit to generate If so, it is wise to decide not to do it because it is useless to do it.

I want to verify how accurate it will be if I do machine learning. This is OK.

Whether the experiment works or not until the verification as POC If the purpose is loose, you can use the verification result as a result. I think it may be for the purpose of verification.

Most of the time, you just throw your money away.

1. Data acquisition

Once the purpose is decided, it is necessary to create the data accordingly.

If you have already acquired the data and want to use it, just send and receive the data Very few words.

However, there is no data yet, and it will be difficult to start acquiring data from now on. Make sure to design the data to see what kind of data will work. We have to start by creating a mechanism that allows us to acquire data without excess or deficiency.

All you have to do is check to see if your clients and themselves have the right data, and if so, decide how to send and receive the data. If there is no data, it is from the design examination of data acquisition.

Regarding sending and receiving data, you can receive it on HDD or SSD, or you can receive it via cloud storage. I think it's mostly via cloud storage these days.

2. Data understanding / selection / processing

This is the pre-process of what is called data pre-processing.

What kind of data do you have, what kind of data composition ratio do you have, and how much? We perform basic aggregation processing such as, and analyze and visualize the data.

Then, we will select data that can be used.

When it comes to large data, it takes many days just to grasp the data. If you don't understand the characteristics of the data properly here, the work will go back and forth.

3. Data mart (data set) creation

Now, let's create data for machine learning from here. After narrowing down the data candidates that can be used to some extent, we will make the data usable for machine learning.

Only tabular data that can be used for machine learning must be grouped together.

Supervised machine learning requires data for the explanatory variable to be used for learning and the correct label to represent the correct answer.

It is necessary to process the part that says what you want to predict into a column of correct label.

Also, all the data used as explanatory variables must be converted to numerical values.

The work here is the process of data preprocessing.

There are few machine learning projects that say that the data is pretty clean. There are few data that require almost no preprocessing Unless it is designed to collect data neatly at the stage of data acquisition You spend time processing the data.

In machine learning, we will combine the data into one tabular format. If there are many types of data, it is necessary to devise ways to put them together.

Usually, you will end up with thousands to tens of thousands of columns of data.

The number of lines varies considerably depending on the type of business and the data acquisition mechanism. Don't worry too much, but if you have a small number of lines It may affect the accuracy.

As a result of preprocessing, there are 2 million lines that can be used 20 lines Then, there is a difference in accuracy.

4. Model creation

Creating a model basically does not require so much man-hours. Even if you make many models, not all models will be used.

All you have to do is make one model that is accurate and usable.

You have to try a lot of techniques to create a model If you do it to a certain extent, it will be decided that this method is good, and if you select a method mechanically, you only have to try 10 types of methods at once and wait for the result, so it takes a lot of effort. not.

For services like DataRobot if you have the data at hand It will automatically create various models using that data.

Creating a model is now very easy and not time consuming This is a relatively small part in terms of machine learning man-hours.

5. Accuracy verification

It will be done as a set with model creation, but we will make it while verifying how accurate it is after making a model.

There are several methods of accuracy verification, but in general, we will look at how much error occurred.

Since the one with less error is said to be a good model, I think that the models will be arranged in the order of less error, and finally one of the models with higher accuracy will be adopted.

Until a good machine learning flow is made

Data understanding / selection / processing
Data mart (data set) creation
Model creation
Accuracy verification

Will be repeated, and if the accuracy is still satisfactory

Data acquisition

You may have to start over.

A good model can only come from good data. Garbage data is nothing more than garbage.

It is quite rare that a treasure trove is mixed in the garbage data.

6. System implementation

When the data has been prepared and it is found that the accuracy is reasonable, we will finally incorporate the model into the system.

Generally, if it is a WEB service, it is incorporated so that it is provided as a part of the function on the back end side.

It will be a form of building a system while considering how to operate it and how much it will cost, together with the system requirements.

It ’s a machine learning service for AWS sagemaker. Some of them will be available as endpoints immediately. Using such a service will reduce the number of implementations.

Summary

After learning what you can do with machine learning, it's a good idea to learn what to do to do machine learning.

The general flow is the same, so I think it's best to refer to the methods of various companies.

23 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython