About 3 years ago [Results of developing a machine learning stock price forecasting program by myself as a WEB shop \ | Masamunet](http://masamunet.com/2016/11/09/web%E5%B1%8B%E3%81 % AE% E8% 87% AA% E5% 88% 86% E3% 81% 8C% E6% A9% 9F% E6% A2% B0% E5% AD% A6% E7% BF% 92% E6% A0% AA % E4% BE% A1% E4% BA% 88% E6% 83% B3% E3% 83% 97% E3% 83% AD% E3% 82% B0% E3% 83% A9% E3% 83% A0% E3 % 82% 92% E9% 96% 8B% E7% 99% BA /) I wrote an article, and I didn't do much with machine learning or stock prices, but this year-end and New Year, I took the plunge. I tried various things. That's why I got a pretty good result, so I will write about the technical background that led to the operation as a service as it is. The link to the service itself is a paid service, so if you link it for religious reasons, you may not be able to see it, so I will not put it here. Personally, I think it's important to create a system that allows engineers to earn money appropriately, but in other words, I think it would be nice if we could provide guidance to services that are deliverables, even if they are paid. It's not a place to discuss such things here, so I'll refrain from linking to the service itself anyway. However, I think there are opinions that you can not judge without seeing the actual service, so I would like to put a link at the end of the article that will lead you to the service if you follow it lightly. .. So much for the introduction, I will introduce what the final structure was.

resemble_stock_全体.png

In a nutshell, what I want to say about the article is that when I tried to make it a WEB service so that the results of machine learning were good and it was widely used, Docker was compatible, so I would like to introduce it.

We use various technologies across the board, but all of them use only the basics, so I don't think detailed explanation is necessary. However, I am worried that I will proceed with the story with the body of "I know you", so I will talk a little about the explanation of each technology.

Docker --Virtual container technology. The benefits of a virtual machine that virtually locks in a machine using file I / O and the convenience of stacking environments like a container are enjoyable.
Python ――Machine learning has an atmosphere like Python. It tends to be more useful to those who derive some answers based on the data calculated using the program, rather than the programmers who deliver the source code itself. However, the name of Python has spread to people who are not familiar with programs anymore, and in the future there will be people who call it Python, and it will be a lot of trouble. Python is innocent.
- Flask ――It's very convenient to make a program made with Python when you want to publish it to the outside.
sqlite ――I've had a hard time with CSV even though I just want to execute a simple SQL statement, but it's not enough to use RDBMS, but it's rather useful when I don't want to use it. As I will explain later, I personally think that it is perfect for scraping because the data is highly portable.
NuxtJS ――As the word “mochi is mochiya” suggests, if possible, the front end can be improved by using the front end framework.
Express.js ――As the word “mochi is mochiya” implies, the backend, which interacts directly with the frontend, can be improved in various ways by using a dedicated framework. The wire you want to provide to the user and the wire that the system is convenient for are not always the same, so how to absorb that is the key to designing an API that directly interacts with the front end. NuxtJS alone can perform backend processing at the time of server side rendering, but from my personal experience, if NodeJS can be used on the server side, that is, if you want to run NuxtJS in an environment where server side rendering is possible, it is easy. Even in back-end processing, you should consider introducing Express.js from the beginning without trying to cover it with NuxtJS alone.
AmazonPay --Amazon's payment service for small charges. I used it for the first time this time, so I'm trying various things to see if it's good or bad. I was at a loss with paypal, but I chose Amazon because it is familiar to Japan. Personally, I think that if paypal becomes more popular, it will be easier to excite local small-amount payments and it will be helpful.

Docker edition

I don't need to explain the story about Docker here anymore, but even if I omit the explanation, the story will not proceed, so I will introduce only the superficial part without fear of misunderstanding. .. Docker is characterized by two features: virtualization with low overhead and containerization of a freely stackable environment. Remember these two, virtualization and containerization. Container virtualization technology existed before Docker, and in fact Docker also has a historical history of using an external container virtualization library in versions prior to 0.9, so only Docker is exclusively for container virtualization. Although it is not a patent, the master-slave relationship that Docker has container virtualization as a major feature is established. First virtualization. Most of the traditional virtualization other than container virtualization uses the idea of a virtual machine and tries to virtually prepare everything from the CPU. It was a technology that behaved as if there was a machine for the execution environment (also called the guest environment). At that point, Docker virtualized the file inputs and outputs. By controlling the read / write part of the file and acting as if there was a dedicated machine there, the execution environment was separated. This creates less overhead. Next, containerization. This means that if you turn off the power, you can rewind to this point and set the rewind point freely. Imagine a PC where you make a rewind point once in a clean Windows and the freshly installed Windows starts up every time you turn off the power. Next, let's say you also created a rewind point with Solitaire installed, as that alone would be inefficient. The first Windows, Windows with Solitaire installed, and the second layer container in Docker were stacked. Next, I would like to install Netscape Navigator 4. You can install it on a clean Windows, or you may install it further on the environment where Solitaire is installed. You will be able to choose which environment to add. This is a stack of Docker containers. The convenience of containerization is that you can stack as many things as you need and restore them when you turn off the power. Virtualization with low overhead and containerization that allows you to build a free environment and return it at any time. This is the nice thing about Docker. This time we will take full advantage of this technology.

Machine learning

I proceeded with this configuration.

In machine learning, data is collected, predicted, calculated, verified based on the results, and then predicted and repeated, so data collection and machine learning are inevitable during development. I often repeat the process. From the beginning, it can be difficult to design to be published as a service. So, if you only designed it so that each operation is separated for each folder, it was easy to make it API by using Flask afterwards only the machine learning part. As a knack, when you gradually start to think that it seems that you can publish it as a service while repeating machine learning, if you make it a function while being aware of the entrance of the API, it will be easier to make it as it is with Flask later. is. Also, since the backend described later can be built separately, Flask can be closed locally in the Docker network later. It is also convenient that the API side here requires only minimal sanitization and validation. Of course, you have to be careful about vulnerabilities that can take over the host from the guest side of Docker, but since you can exchange data as authenticated, so here ** almost ** do nothing Being able to transfer data without it is a great advantage. Again, doing almost nothing is a story with minimal sanitization, validation, and certification. Please note that it is a matter of doing SQL using placeholders, not executing anything with root privileges, and doing at least normal things. The definition of what is normal is important, but the story is too widespread, so let me omit it.

For reference, I will publish the code that is actually running. The code is sloppy, but if you are humble with the sloppy code, the story won't go on, so let's move on. What I want to say is that if you can manage it in a properly closed state, you can operate the code just by passing it like this until the production.

`service.py`


# %%
@app.route('/')
def index():
  return jsonify({
    'status':200,
    'data': {}
  })

# %%
@app.route('/count')
def count():
  return jsonify({
    'status':200,
    'data': search_stock.get_last_data_count()
  })

# %%
@app.route('/last-date')
def last_date():
  return jsonify({
    'status': 200,
    'data': search_stock.get_last_date()
  })

# %%
@app.route('/getstock/<code>')
def getstock(code):
  low = int(request.args.get('low'))
  high = int(request.args.get('high'))
  isStockExist = search_stock.get_stock_name(code)
  if isStockExist is None or isStockExist is '':
    message = '{0} is not found'.format(code)
    return jsonify({'message': 'error'}), 500
  data = search_stock.get_stock_list(low, high, code)
  return jsonify({
    'status': 200,
    'data': data
  })

# %%
@app.route('/getresemble/<code>', methods=['POST'])
def get_resemble(code):
  data = request.get_data()
  json_data = json.loads(data)
  stock_list = json_data['list']
  isStockExist = search_stock.get_stock_name(code)
  if isStockExist is None or isStockExist is '':
    message = '{0} is not found'.format(code)
    return jsonify({'message': 'error'}), 500
  result = get_stock.start(code, stock_list)
  return jsonify({
    'status': 200,
    'data': result,
  })

# %%
if __name__ == "__main__":
  app.run(host='0.0.0.0',port=5000,debug=True)

Data part

I don't think it's okay to make everything into a Docker container, but I also made the data into a Docker container. At first, the data part was planned to be mounted as an external volume from Docker, but as many of you know, when you try to mount a host volume from Docker on linux, you need root privileges on the host side. The only solution is to do something about the Docker container side or the Docker host side, but in my case, there is a gitlab-runner user operating on Gitlab's CI / CD. , There is an environment where gitlab-runner is executed by sudo, that is, the user's situation varies depending on the operating environment, so I obediently locked it in Docker. The good thing about making it a Docker container here is that the portability of the data has also increased, so we have created a powerful scraping environment together with the scraping part that follows. For reference, I will introduce the Dockerfile of the data part.

FROM busybox
WORKDIR /data
COPY . ./
VOLUME [ "/data" ]

The nice thing about sqlite is that you can carry the entire DB environment with you as long as you lock the files as they are. Also, as a table design of stock price data, there are worries such as whether to divide the table for each stock or to operate the stock code like GROUP BY code in one table, but there are good and bad points. It's hard to attach to the table. This time, I tried to operate it with one DB for one brand. To access the stock code 1234, access the DB of the stock code 1234. It takes a little overhead to access the stocks across the board, but anyway, this is also good and bad, so this time I did it. I would like to talk in more detail about the good and bad of DB design of stock prices if there is an opportunity.

The disadvantage of operating the entire sqlite file with a Docker image is that you need to recreate the image every time you update the data, so if you do not manage the CI / CD environment well, the images will accumulate more and more. The portability is not so high, because it depends on the user's idea of whether Docker-in-Docker (DinD) is allowed or not, but if DinD is not used, the data should be updated on the shell. This is because the execution environment is selected after all. If you don't like that, you have no choice but to dig deeper into the DinD environment, but if you want to bring tuned DinD to the production environment in the future, whether it is better to stabilize it under the shell environment, even kubernetes There are many parts to take into consideration, such as recreating with RDBMS faster. From a very personal point of view, DinD has a historical background that was originally used quite unexpectedly even though it was supported by the Docker official, and DonD could become a security hole of the host OS if it is not good. It is hard to say that the environment surrounding DinD is ready, such as having to do it, so I try to touch the Dockerfile from the shell as much as possible. There is no doubt that it is a very convenient and powerful technology, so expectations are very high. I thought that it would be a good idea to make a whole image with sqlite when giving priority to moving it for the time being, after taking a quick look at such various things.

Scraping part

Since the data handled this time is stock prices, improving the portability of the scraping environment was a top priority. I always want to scrape the latest data, so I chose the Docker container without hesitation at first. It is very convenient because you can automate up to this point, without thinking about anything, scraping when the container starts, and dropping the entire container when finished. You can use this as it is with crontab, or you can easily create EC2 when the time comes, automatically start the container when the EC2 side starts, and end EC2 for each container when scraping is completed. It was all good, but in the end, I prepared two main parts of the scraping part, one that can be executed with Docker and the other that can be executed directly with the shell. This is because Docker was not convenient in all environments. After all, the problem that the host-side volume mounted by Docker becomes root-owned cannot be avoided here either, and in such an environment it was better to execute it in the host-side shell. Therefore, we have made it possible to prepare three types of environments, such as when it can be executed directly in the shell, when the Python execution environment is covered by Docker, and when the scraping environment is performed in the Docker container.

Front end

It's a waste to make this product only for yourself as machine learning progresses, so I definitely want you to publish it on the WEB and use it widely! Then, the next step is front-end development. As a guide when I personally choose a framework, I tend to choose a backend framework such as laravel when backend development is heavy, and a frontend framework as much as possible otherwise. Of course, depending on the project, there may be restrictions on the server that can be operated, so this is not the case, but at least if you can choose freely, try to build from the front end as much as possible. Again, the most complicated part of machine learning has already been completed, so I chose the Express.js + NuxtJS configuration without hesitation because I didn't need any more complicated backend processing.

Again, I don't want to say it's safe because it's closed. However, this is the only part that can be seen in the table, and there is no way to know what is going on inside unless you take some other method such as hacking from the kernel. I think that the more experienced engineers are, the easier it is to design.

I don't think it's good to make everything into a Docker container, but in the end I made it into a Docker container as well. There are a lot of trials and errors under development here, so I don't want to do things like containerize and start, check the log in the container, etc., especially the execution environment is fixed in the domain, Node on the host side I couldn't start it with .js, or it was the part that was packed to the end. But in the end, I also locked everything here in a Docker container. The reason is that when interacting with the backend, it is easy to be influenced by the network environment, so I found it convenient to lock it entirely in the Docker network. In the end, Docker opened only https and http ports 443 and 80, and then exchanged inside the Docker network.

AmazonPay This was the hardest part, and Amazon provided me with polite Japanese explanations and SDKs for various languages, so my skills weren't enough, but I spent a lot of time developing it. .. First of all, there are various languages in the SDK, but there is no one for Node.js, so it took time to try to communicate directly from Express.js. I wanted to try using axios to interact directly with the API, but maybe my search was bad, I could only find a simple sample for direct communication with the API, and it took a tremendous amount of time to find the correct answer by trial & error. I used. Because AmazpnPay only accepts communication from https under the pre-registered domain, so if you try one at this point, merge it into the develop branch to add one console.log → gitlab-runner will run Wait for Docker containerization → Docker After starting, I knew that it was not very realistic because I had to repeat the operation check with the browser → display the log of the working container with Docker logs of the SSH server. I may find a better way in the future, but the only thing I can do with the current flow is to create an API that hits the API with the provided SDK, so I chose the Python SDK this time. The front end also had a little trouble. Since the design is based on the same assumption that the loading and drawing timings are the same, it is not possible to load only the module in advance and generate the UI at the drawing timing. Furthermore, since the life cycle of module discard → generation is not taken into consideration, it is extremely difficult to reload it every time it is drawn, so I adjusted it to some extent using the life cycle of Vue.js. It was a little uneasy implementation because it was not the behavior that the creator of the thing expected. This is fatal for the payment button where the viewer moves back and forth between the cart screens for the convenience of the viewer, and the only way to redraw is to reload. The convenience of this creator is that the Amazon Pay button at least the scratchpad is maintained. 4 years ago when it was done, I have to say that the design is old. I've just complained, but I think it's a payment service with great potential, so I'd really like to see improvements.

What I wanted to say was that it was easy to make it a service with Docker

This time, all of them were unexpectedly operated in Docker containers, but I felt that the method of creating without thinking about anything first and then designing the container is very compatible with the service of machine learning. It was. Since it is a container, it may be easy to introduce it with kubernetes when scaling the service in the future, and I think it will make a lot of progress. This time, we have introduced various technologies across the board, but all of them touch only the basics, so it may not be enough depending on the viewpoint. However, those who are crazy about machine learning may not know how to easily deploy using Docker, and those who are crazy about Nuxt.js may come up with good ways to use machine learning. We hope that the record of the technology introduced across the board will give you some hints. I didn't touch on the machine learning method itself this time, but see the article here on how to handle stock prices with machine learning. It is written separately. If you like, please refer to that as well.

I tried to build a service that sells machine-learned data at explosive speed with Docker