I tried to build a service that sells machine-learned data at explosive speed with Docker

About 3 years ago [Results of developing a machine learning stock price forecasting program by myself as a WEB shop \ | Masamunet](http://masamunet.com/2016/11/09/web%E5%B1%8B%E3%81 % AE% E8% 87% AA% E5% 88% 86% E3% 81% 8C% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E6% A0% AA % E4% BE% A1% E4% BA% 88% E6% 83% B3% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% A0% E3 % 82% 92% E9% 96% 8B% E7% 99% BA /) I wrote an article, and I didn't do much with machine learning or stock prices, but this year-end and New Year, I took the plunge. I tried various things. That's why I got a pretty good result, so I will write about the technical background that led to the operation as a service as it is. The link to the service itself is a paid service, so if you link it for religious reasons, you may not be able to see it, so I will not put it here. Personally, I think it's important to create a system that allows engineers to earn money appropriately, but in other words, I think it would be nice if we could provide guidance to services that are deliverables, even if they are paid. It's not a place to discuss such things here, so I'll refrain from linking to the service itself anyway. However, I think there are opinions that you can not judge without seeing the actual service, so I would like to put a link at the end of the article that will lead you to the service if you follow it lightly. .. So much for the introduction, I will introduce what the final structure was.

resemble_stock_全体.png

In a nutshell, what I want to say about the article is that when I tried to make it a WEB service so that the results of machine learning were good and it was widely used, Docker was compatible, so I would like to introduce it.

We use various technologies across the board, but all of them use only the basics, so I don't think detailed explanation is necessary. However, I am worried that I will proceed with the story with the body of "I know you", so I will talk a little about the explanation of each technology.

Docker edition

I don't need to explain the story about Docker here anymore, but even if I omit the explanation, the story will not proceed, so I will introduce only the superficial part without fear of misunderstanding. .. Docker is characterized by two features: virtualization with low overhead and containerization of a freely stackable environment. Remember these two, virtualization and containerization. Container virtualization technology existed before Docker, and in fact Docker also has a historical history of using an external container virtualization library in versions prior to 0.9, so only Docker is exclusively for container virtualization. Although it is not a patent, the master-slave relationship that Docker has container virtualization as a major feature is established. First virtualization. Most of the traditional virtualization other than container virtualization uses the idea of a virtual machine and tries to virtually prepare everything from the CPU. It was a technology that behaved as if there was a machine for the execution environment (also called the guest environment). At that point, Docker virtualized the file inputs and outputs. By controlling the read / write part of the file and acting as if there was a dedicated machine there, the execution environment was separated. This creates less overhead. Next, containerization. This means that if you turn off the power, you can rewind to this point and set the rewind point freely. Imagine a PC where you make a rewind point once in a clean Windows and the freshly installed Windows starts up every time you turn off the power. Next, let's say you also created a rewind point with Solitaire installed, as that alone would be inefficient. The first Windows, Windows with Solitaire installed, and the second layer container in Docker were stacked. Next, I would like to install Netscape Navigator 4. You can install it on a clean Windows, or you may install it further on the environment where Solitaire is installed. You will be able to choose which environment to add. This is a stack of Docker containers. The convenience of containerization is that you can stack as many things as you need and restore them when you turn off the power. Virtualization with low overhead and containerization that allows you to build a free environment and return it at any time. This is the nice thing about Docker. This time we will take full advantage of this technology.

Machine learning

I proceeded with this configuration.

resebble_stock_python.png

In machine learning, data is collected, predicted, calculated, verified based on the results, and then predicted and repeated, so data collection and machine learning are inevitable during development. I often repeat the process. From the beginning, it can be difficult to design to be published as a service. So, if you only designed it so that each operation is separated for each folder, it was easy to make it API by using Flask afterwards only the machine learning part. As a knack, when you gradually start to think that it seems that you can publish it as a service while repeating machine learning, if you make it a function while being aware of the entrance of the API, it will be easier to make it as it is with Flask later. is. Also, since the backend described later can be built separately, Flask can be closed locally in the Docker network later. It is also convenient that the API side here requires only minimal sanitization and validation. Of course, you have to be careful about vulnerabilities that can take over the host from the guest side of Docker, but since you can exchange data as authenticated, so here ** almost ** do nothing Being able to transfer data without it is a great advantage. Again, doing almost nothing is a story with minimal sanitization, validation, and certification. Please note that it is a matter of doing SQL using placeholders, not executing anything with root privileges, and doing at least normal things. The definition of what is normal is important, but the story is too widespread, so let me omit it.

For reference, I will publish the code that is actually running. The code is sloppy, but if you are humble with the sloppy code, the story won't go on, so let's move on. What I want to say is that if you can manage it in a properly closed state, you can operate the code just by passing it like this until the production.

service.py


# %%
@app.route('/')
def index():
  return jsonify({
    'status':200,
    'data': {}
  })

# %%
@app.route('/count')
def count():
  return jsonify({
    'status':200,
    'data': search_stock.get_last_data_count()
  })

# %%
@app.route('/last-date')
def last_date():
  return jsonify({
    'status': 200,
    'data': search_stock.get_last_date()
  })

# %%
@app.route('/getstock/<code>')
def getstock(code):
  low = int(request.args.get('low'))
  high = int(request.args.get('high'))
  isStockExist = search_stock.get_stock_name(code)
  if isStockExist is None or isStockExist is '':
    message = '{0} is not found'.format(code)
    return jsonify({'message': 'error'}), 500
  data = search_stock.get_stock_list(low, high, code)
  return jsonify({
    'status': 200,
    'data': data
  })

# %%
@app.route('/getresemble/<code>', methods=['POST'])
def get_resemble(code):
  data = request.get_data()
  json_data = json.loads(data)
  stock_list = json_data['list']
  isStockExist = search_stock.get_stock_name(code)
  if isStockExist is None or isStockExist is '':
    message = '{0} is not found'.format(code)
    return jsonify({'message': 'error'}), 500
  result = get_stock.start(code, stock_list)
  return jsonify({
    'status': 200,
    'data': result,
  })

# %%
if __name__ == "__main__":
  app.run(host='0.0.0.0',port=5000,debug=True)

Data part

I don't think it's okay to make everything into a Docker container, but I also made the data into a Docker container. At first, the data part was planned to be mounted as an external volume from Docker, but as many of you know, when you try to mount a host volume from Docker on linux, you need root privileges on the host side. The only solution is to do something about the Docker container side or the Docker host side, but in my case, there is a gitlab-runner user operating on Gitlab's CI / CD. , There is an environment where gitlab-runner is executed by sudo, that is, the user's situation varies depending on the operating environment, so I obediently locked it in Docker. The good thing about making it a Docker container here is that the portability of the data has also increased, so we have created a powerful scraping environment together with the scraping part that follows. For reference, I will introduce the Dockerfile of the data part.

FROM busybox
WORKDIR /data
COPY . ./
VOLUME [ "/data" ]

The nice thing about sqlite is that you can carry the entire DB environment with you as long as you lock the files as they are. Also, as a table design of stock price data, there are worries such as whether to divide the table for each stock or to operate the stock code like GROUP BY code in one table, but there are good and bad points. It's hard to attach to the table. This time, I tried to operate it with one DB for one brand. To access the stock code 1234, access the DB of the stock code 1234. It takes a little overhead to access the stocks across the board, but anyway, this is also good and bad, so this time I did it. I would like to talk in more detail about the good and bad of DB design of stock prices if there is an opportunity.

The disadvantage of operating the entire sqlite file with a Docker image is that you need to recreate the image every time you update the data, so if you do not manage the CI / CD environment well, the images will accumulate more and more. The portability is not so high, because it depends on the user's idea of whether Docker-in-Docker (DinD) is allowed or not, but if DinD is not used, the data should be updated on the shell. This is because the execution environment is selected after all. If you don't like that, you have no choice but to dig deeper into the DinD environment, but if you want to bring tuned DinD to the production environment in the future, whether it is better to stabilize it under the shell environment, even kubernetes There are many parts to take into consideration, such as recreating with RDBMS faster. From a very personal point of view, DinD has a historical background that was originally used quite unexpectedly even though it was supported by the Docker official, and DonD could become a security hole of the host OS if it is not good. It is hard to say that the environment surrounding DinD is ready, such as having to do it, so I try to touch the Dockerfile from the shell as much as possible. There is no doubt that it is a very convenient and powerful technology, so expectations are very high. I thought that it would be a good idea to make a whole image with sqlite when giving priority to moving it for the time being, after taking a quick look at such various things.

Scraping part

Since the data handled this time is stock prices, improving the portability of the scraping environment was a top priority. I always want to scrape the latest data, so I chose the Docker container without hesitation at first. It is very convenient because you can automate up to this point, without thinking about anything, scraping when the container starts, and dropping the entire container when finished. You can use this as it is with crontab, or you can easily create EC2 when the time comes, automatically start the container when the EC2 side starts, and end EC2 for each container when scraping is completed. It was all good, but in the end, I prepared two main parts of the scraping part, one that can be executed with Docker and the other that can be executed directly with the shell. This is because Docker was not convenient in all environments. After all, the problem that the host-side volume mounted by Docker becomes root-owned cannot be avoided here either, and in such an environment it was better to execute it in the host-side shell. Therefore, we have made it possible to prepare three types of environments, such as when it can be executed directly in the shell, when the Python execution environment is covered by Docker, and when the scraping environment is performed in the Docker container.

Front end

It's a waste to make this product only for yourself as machine learning progresses, so I definitely want you to publish it on the WEB and use it widely! Then, the next step is front-end development. As a guide when I personally choose a framework, I tend to choose a backend framework such as laravel when backend development is heavy, and a frontend framework as much as possible otherwise. Of course, depending on the project, there may be restrictions on the server that can be operated, so this is not the case, but at least if you can choose freely, try to build from the front end as much as possible. Again, the most complicated part of machine learning has already been completed, so I chose the Express.js + NuxtJS configuration without hesitation because I didn't need any more complicated backend processing.

resemble_stock_front.png

Again, I don't want to say it's safe because it's closed. However, this is the only part that can be seen in the table, and there is no way to know what is going on inside unless you take some other method such as hacking from the kernel. I think that the more experienced engineers are, the easier it is to design.

I don't think it's good to make everything into a Docker container, but in the end I made it into a Docker container as well. There are a lot of trials and errors under development here, so I don't want to do things like containerize and start, check the log in the container, etc., especially the execution environment is fixed in the domain, Node on the host side I couldn't start it with .js, or it was the part that was packed to the end. But in the end, I also locked everything here in a Docker container. The reason is that when interacting with the backend, it is easy to be influenced by the network environment, so I found it convenient to lock it entirely in the Docker network. In the end, Docker opened only https and http ports 443 and 80, and then exchanged inside the Docker network.

AmazonPay This was the hardest part, and Amazon provided me with polite Japanese explanations and SDKs for various languages, so my skills weren't enough, but I spent a lot of time developing it. .. First of all, there are various languages in the SDK, but there is no one for Node.js, so it took time to try to communicate directly from Express.js. I wanted to try using axios to interact directly with the API, but maybe my search was bad, I could only find a simple sample for direct communication with the API, and it took a tremendous amount of time to find the correct answer by trial & error. I used. Because AmazpnPay only accepts communication from https under the pre-registered domain, so if you try one at this point, merge it into the develop branch to add one console.log → gitlab-runner will run Wait for Docker containerization → Docker After starting, I knew that it was not very realistic because I had to repeat the operation check with the browser → display the log of the working container with Docker logs of the SSH server. I may find a better way in the future, but the only thing I can do with the current flow is to create an API that hits the API with the provided SDK, so I chose the Python SDK this time. The front end also had a little trouble. Since the design is based on the same assumption that the loading and drawing timings are the same, it is not possible to load only the module in advance and generate the UI at the drawing timing. Furthermore, since the life cycle of module discard → generation is not taken into consideration, it is extremely difficult to reload it every time it is drawn, so I adjusted it to some extent using the life cycle of Vue.js. It was a little uneasy implementation because it was not the behavior that the creator of the thing expected. This is fatal for the payment button where the viewer moves back and forth between the cart screens for the convenience of the viewer, and the only way to redraw is to reload. The convenience of this creator is that the Amazon Pay button at least the scratchpad is maintained. 4 years ago when it was done, I have to say that the design is old. I've just complained, but I think it's a payment service with great potential, so I'd really like to see improvements.

What I wanted to say was that it was easy to make it a service with Docker

This time, all of them were unexpectedly operated in Docker containers, but I felt that the method of creating without thinking about anything first and then designing the container is very compatible with the service of machine learning. It was. Since it is a container, it may be easy to introduce it with kubernetes when scaling the service in the future, and I think it will make a lot of progress. This time, we have introduced various technologies across the board, but all of them touch only the basics, so it may not be enough depending on the viewpoint. However, those who are crazy about machine learning may not know how to easily deploy using Docker, and those who are crazy about Nuxt.js may come up with good ways to use machine learning. We hope that the record of the technology introduced across the board will give you some hints. I didn't touch on the machine learning method itself this time, but see the article here on how to handle stock prices with machine learning. It is written separately. If you like, please refer to that as well.

Recommended Posts

I tried to build a service that sells machine-learned data at explosive speed with Docker
[Go + Gin] I tried to build a Docker environment
[Python] A memo that I tried to get started with asyncio
I tried to implement a blockchain that actually works with about 170 lines
I tried to build a Mac Python development environment with pythonz + direnv
I tried to make a url shortening service serverless with AWS CDK
I tried to build a super-resolution method / ESPCN
I tried to build a super-resolution method / SRCNN ①
I tried to save the data with discord
I tried to get CloudWatch data with Python
I tried to build a super-resolution method / SRCNN ③
I tried to build a super-resolution method / SRCNN ②
I tried to draw a route map with Python
I tried to build ML Pipeline with Cloud Composer
I tried to automatically generate a password with Python3
A story that I was addicted to at np.where
I tried collecting data from a website with Scrapy
I tried to analyze J League data with Python
I tried to build an environment with WSL + Ubuntu + VS Code in a Windows environment
[Python + Bottle] I tried to release a web service that visualizes Twitter's positioned tweets.
[AWS] Development environment version that tried to build a Python environment with eb [Elastic Beanstalk]
I tried scraping food recall information with Python to create a pandas data frame
I tried to implement a volume moving average with Quantx
I tried to automatically create a report with Markov chain
[1 hour challenge] I tried to make a fortune-telling site that is too suitable with Python
Build a web API server at explosive speed using hug
I tried to solve a combination optimization problem with Qiskit
I tried to get started with Hy ・ Define a class
I tried to make a generator that generates a C # container class from CSV with Python
I tried to sort a random FizzBuzz column with bubble sort.
I tried to divide with a deep learning language model
[Graduation from article scattering] I tried to develop a service that can list articles by purpose
[AWS] Flask application deployment version that tried to build a Python environment with eb [Elastic Beanstalk]
A story that didn't work when I tried to log in with the Python requests module
I tried to build a SATA software RAID configuration that boots the OS on Ubuntu Server
[5th] I tried to make a certain authenticator-like tool with python
I want to use a wildcard that I want to shell with Python remove
I tried to make a system that fetches only deleted tweets
[2nd] I tried to make a certain authenticator-like tool with python
A memorandum when I tried to get it automatically with selenium
[3rd] I tried to make a certain authenticator-like tool with python
I tried to create a list of prime numbers with python
[Pandas] I tried to analyze sales data with Python [For beginners]
I tried to make a periodical process with Selenium and Python
I tried to create Bulls and Cows with a shell program
I tried a tool that imitates Van Gogh's pattern with AI
I tried to make a todo application using bottle with python
[4th] I tried to make a certain authenticator-like tool with python
[1st] I tried to make a certain authenticator-like tool with python
Created a service that allows you to search J League data
I tried to make a strange quote for Jojo with LSTM
I tried to make a mechanism of exclusive control with Go
Build a deb file with Docker
I tried to build an environment that can acquire, store, and analyze tweet data in WSL (bash)
[AWS] Flask application deployment preparation edition that tried to build a Python environment with eb [Elastic Beanstalk]
I tried to introduce a serverless chatbot linked with Rakuten API to Teams
Python: I tried to make a flat / flat_map just right with a generator
[Data science basics] I tried saving from csv to mysql with python
I tried to implement deep learning that is not deep with only NumPy
I tried to communicate with a remote server by Socket communication with Python.
I tried to create a program to convert hexadecimal numbers to decimal numbers with python