(Added on 2017/08/16)
This article is based on that knowledge of the time. It contains obvious mistakes, so I hope you can refer to it moderately.
It's still a little early in mid-February, but it's a double-digit temperature every day and it's spring weather. I think that the news of "Sakurasaku" has begun to reach the people who are watching here. ~~ On the other hand, I have been losing 5 job hunting games in a row. ~~
By the way, about three weeks ago, "Google's Go AI'AlphaGo'wins the professional Go player for the first time in history" Shocking news came in. Go is considered an unsolved issue, and I had just talked about "this could be used for evaluation" the other day, so I was very surprised at the news that it had already been settled in October last year. And at the same time, I thought about this.
"Isn't there a lot of problems that have already been solved just because we don't know? 』\
It may seem surprising, but in fact some games are already learning better behavior than humans. .. (Here is very detailed) The Deep Q-Network used in this video is a 2013 method, and the paper at that time used a very simple model. On the other hand, in the field of image processing, research has progressed in the last few years, and the effectiveness of a model that is far more difficult than the model used in Deep Q-Network has been recognized. The result of AlphaGo is also the feedback of such image processing. So, "If you bring in image processing technology like AlphaGo or more advanced than AlphaGo, it may be possible to capture complicated action games that have not been tried without using inside information. I thought.
This is a little over two weeks of a student who thought about it and acted.
Deep Q-Network and similar studies mainly use the Atari 2600 game released in 1977 as a benchmark. This is partly because the library is complete, but the main reason is that it's harder to capture more complex games than the Atari 2600. Especially due to the difficulty of three-dimensional information processing, 3D action games are said to be difficult.
So I focused on 2D shooters as something that is harder than the Atari 2600, easier and more appropriate than 3D. 2D shooters are 2D and don't require as much information as 3D, but they require more complexity and speed than Atari 2600 games. Among them, this time, the trial version is released for free and is easy to obtain Touhou Konjuden ~ Legacy of Lunatic Kingdom. Was targeted for capture.
~~ Actually, the reason I just like the Touhou Project is bigger ~~
Roughly speaking, it is a mechanism that exchanges screenshots of the game screen between Windows and Ubuntu and returns the operation to the game.
Simple video is available on Twitter.
The main programs except model are published on github ... However, it is still in the peaky setting in the prototype state. I don't think it's suitable to write something based on this.
The client takes a screenshot with PIL, numpy it, and sends the image with socket.
When the server receives it with socket, it formats it with numpy and OpenCV, and Chainer makes death judgment and action decision. If the death judgment is not given, the action is returned by socket. If a death decision is made, a meta sentence is returned and the episode ends. Also, depending on the case, we will start learning with Chainer.
In response to the response, the client sends an action corresponding to DirectInput (DirectX game input system) with SendInput. When the death judgment meta statement is returned, the operation is paused and waits. It keeps waiting if learning takes place, and resumes operation otherwise.
In the Touhou Konjuden, the words "Capture failure" are displayed at the time of death. Therefore, this is used to determine death. Specifically, the server side specifies that the area where this character is displayed is cut out, and whether it is a game image or not is judged by a simple model with three layers. This could be judged with a performance of 99% or more. (However, it is a ball that sometimes malfunctions)
Select the action with the highest rating for a single frame of image, as shown above. There are a total of 18 patterns of behavior, and they are as follows.
z (shot button) z (shot button) + 8 directions z (shot button) + SHIFT (slow movement) z (shot button) + SHIFT (slow movement) + 8 directions
In addition, the evaluation took the form of learning and estimating by combining behavior and survival time.
I tried various combinations based on Lunatic (highest difficulty), but I couldn't get past the first chapter.
As you can see from the previous Simple video, if it is not RANDOM, the action is uniquely decided. With this, it is inevitable to attack the operating character and you will die. This is probably due to poor learning.
In the above video, I tried to learn the evaluation by sandwiching the learning every time I finished the action to some extent. The above is the result of learning by actually using about 2000 images 5 times each. This is an amount that cannot be said to be sufficient learning even in general image processing.
However, a considerable amount of calculation is required to train with the model of 100 layers or more currently used in image processing. In my environment, it took almost an hour to handle all the images once, even if there were only a few thousand. However, even if you handle it once, you can solve a simple classification problem enough, but as far as I tried this time, it seems that it was not so easy.
In order to do further learning, use a parallel GPU calculation library such as "Distributed TensorFlow" which will be released soon. I think it is necessary. In the above article, an example of parallel processing performed 300 times with 500 units is introduced. If this 300 times processing can be used, the above problem can be solved. (If you actually use this, AlphaGo, which requires nearly 700 days of calculation time, can be learned in just 2 days.) [^ 1]
[^ 1]: I had a little time, so I was checking various documents, but I was skeptical that "this was a misunderstanding, and it was two years since I was dispersed in the first place." I will post it in the original sic for a while until I can confirm it. (Added on 2016/3/13)
However, if you buy 500 TitanX, the most powerful GPUs on the market today, that alone will be over 70 million. It's not an amount that can be paid to poor students, but isn't this amount of calculation manageable?
~~ I want to get a job before that, but I wonder if it can be done ... ~~
History of DQN + Deep Q-Network written in Chainer Take a screenshot with Pillow for Python Transfer images from Raspberry Pi with OpenCV Simulate Python keypresses for controlling a game
AlphaGo uses a technique called Deep Q-Network.
In the first draft, the above sentence was posted at the beginning. However, I received the point that "reinforcement learning is being performed in the process of optimizing the model, but Deep Q-Network is not used, and it is a clear error", and I corrected it.
We would like to take this opportunity to apologize for posting incorrect information.
Recommended Posts