We will automate the data analysis of PES! part3
Hi, this is Yajun!
This article is a series, Part 3. If you like, I hope you can read it together.
- Part 1 article is a creation of character recognition software that extracts character data from match result images.
- article of part2 is to create software to graph the extracted data.
The software is written in Python. ■ Reference URL
■ Promotion
- Bot for event notification of PES I created it, so I hope you can use it.
- Thanks to the content dissemination, I was able to register on the e-sports human resources introduction site. Link
- In the future, we plan to work on Youtube in conjunction with analysis software. Link
■ Introduction
- Through our activities so far, we have created software from data extraction to graph analysis, aiming to add new functions!
"Collect more data and do a new analysis!"
"Let's find a strategy for PES!"
However, a problem has greatly reduced development efficiency ...
** The accuracy of the character recognition software I created was 95%! ** **
▼ The reason why 95% accuracy is not good
- Posted the flow of data analysis
If the accuracy of "② Character data extraction" is 95%, the error of the remaining 5% must be corrected.
It feels like fixing several places with the data for one game.
Let's say you want to continue the analysis service for 10 customers for 6 months.
Targeting professionals and distributors, suppose you get data for 2000 games a year.
10 people x 2000 games (1 year) x 0.5 (half a year) = 10000 games
Even if you fix a few games in one game, it will go bankrupt if it is 10,000 games. .. ..
Therefore, ** 100% data extraction accuracy is a mandatory requirement. ** **
■ Goal
- The goal was to "make the data extraction accuracy from the PES match result image 100%".
■ Method
▼ Method: OCR
- At first, based on OCR, we made efforts to achieve 100% accuracy.
OCR is a technology that extracts character data from characters contained in images.
In conclusion, performance could not be achieved.
** However, I will leave my experience for engineers who want to use OCR. ** **
The accuracy of OCR is composed of the following factors.
① Input data noise
(2) Tool library performance
③ Operation mode
④ Noise removal from the output result
① Input data
- It is difficult to improve the image quality etc. because the image is acquired by the screenshot function attached from PS4.
The quality of the input data is important because OCR extracts features from an object.
(2) Tool library performance
- The one adopted from Part 1 is tesseract-ocr.
It is a free library, often used in python, and there are many sites that you can refer to.
- Next, I considered [Google Vision API](https://cloud.google.com/vision/?hl=ja&utm_source=google&utm_medium=cpc&utm_campaign=japac-JP-all-ja-dr-bkws-all-all -trial-b-dr-1008074 & utm_content = text-ad-none-none-DEV_c-CRE_285865409470-ADGP_Hybrid +% 7C + AW + SEM +% 7C + BKWS + ~ + T1 +% 7C + BMM +% 7C + ML +% 7C + M: 1 + It is% 7C + JP +% 7C + ja +% 7C + Vision +% 7C + API +% 7C + en-KWID_43700016119073988-kwd-154107679017 & userloc_1009288 & utm_term = KW_% 2Bgoogle% 20% 2Bvision% 20% 2Bapi & gclid = Cj0KCQiAkePyBRCEARIsAMy5SctboKWtsTtMGinWFY2UB6glu5SBsspKPxXs3o6MKp25rJ0cb18Ih0caAtheEALw_wcB).
What a god specification that you can run a demo on the tool introduction page! As expected, Google. .. .. I wanted to join the company. .. ..
Immediately, I tried to insert the match result image of PES.
-
It's a cool UI! Information is provided for each tab.
Labal
Logos
Web
Text
Properties
Safe Search
-
The focus is on the "Text" tab, which recognizes the character data from the image.
** From the conclusion, it seems that the expected performance is not achieved. ** **
The place recognized as a character is framed in green, but the numerical data such as "number of crosses" is not recognized.
(I was personally impressed to identify the soccer team from the logo!)
- I thought about other libraries, but it is likely to be useless considering the performance of Google teacher.
I honestly decided to proceed with development at Tesseract.
③ Operation mode
- Operation mode can be specified in the OCR library.
The analysis target is "is it one character?" "Is it a character string?" "Is it vertical writing?" "Is it horizontal writing?" "Is it a number?" "Is it an alphabet?"
If the output result is not a number, change the operation mode and re-execute.
④ Noise removal from the output result
- There may be noise such as "." Or [|] in the final character string.
Therefore, write software to delete characters other than numeric data from the character string.
** The result of doing this is a recognition accuracy of 95%. .. .. ** **
For those who want to see the actual output result, [■ Result](https://qiita.com/junya0001000/items/8f519cdb3846fcec397a#%E7%B5%90%E6%9E%9C--%E5%87%BA%E5 % 8A% 9B% E7% B5% 90% E6% 9E% 9C)
▼ Method: Template Matching
- ** Come on! ** **
Template Matching is one of the image processing to search for a specific image from an image. The library uses Opencv.
The input data, "PES Match Results," has an advantageous feature for Template Matching.
the detail is right below.
Does not rotate
Do not distort
The position of each data is unified
Unified font
Unified size
・ Preparation
- Now that you know the features, let's move on!
Collect "numerical data images" from PES match results as template matching data.
(It's troublesome !!)
・ Demo play
- Execute Template Matching with the collected image data as Template.
The function used is cv2.matchTemplate ().
Write and execute the demo code that is surrounded by a green frame for smaller numbers and a red frame for larger numbers.
It definitely exceeds the performance of the Google Vision API! !!
Save the data as if there is a corresponding number in the place where the frame is displayed.
At that time, it is a mechanism to extract data from the image by letting you judge "which parameter number" depending on the location.
·Detection sensitivity
- In this section, I will write a technical story for engineers.
For Template Matching, it is "detection sensitivity" that bothers engineers.
False positive and false negative changes depending on the "detection sensitivity".
Taking this software as an example
False positives: Judging that there is a number in a place where there is no number
False negative: In fact, where the number is "yes", it is judged that the number is "no".
"Detection sensitivity" can adjust this.
"You should adjust the detection sensitivity so that both false positives and false negatives disappear!"
You might think, but it's usually not possible!
The reason is that "there are often false-positive and false-negative detection sensitivity sections."
- So what do you do?
** Set the detection sensitivity to "only one of false positives and false negatives". ** **
After that, write software to find and eliminate fake from the result.
If the sensitivity is such that false positives and false negatives coexist, the code becomes complicated.
By the way, this software is swinging to the false positive side.
This is because false negatives cannot be detected because the number of objects (number of numbers) is indefinite.
▼ Logic for detecting false negatives ▼
Number of numbers that should be-Number of numbers that could be detected = False negative if there is a difference
■ Results
- I tried running the software using the following 5 images as input images.
▼ Result: Input image
▼ Result: Output result
- Output data is output in CSV.
To make the difference easy to understand, we will also compare the results of "improved OCR software".
Compared to the correct answer data, the incorrect cells are displayed in color.
** Achieved 100% correct answer rate! !! Wow (holiday) **
■ Conclusion
- Thank you for reading this far!
Since the character recognition accuracy was 100%, we were able to fully automate everything from data extraction to graph creation!
With this, you can analyze any number of games ♪
Also, software design that considers false positives / false negatives may have been a learning experience for engineers as well.
I hope it helps someone.
We will continue to add functions and improve performance to the software.
Stay tuned for part4!
■ Reference URL
- Play with Python + OpenCV (OCR)
- Get the number of dimensions, shape and size (total number of elements) of NumPy array ndarray
- Try Google Cloud Vision API
- Template matching
- Remove list (array) elements in Python
- How to use the abs function to find the absolute value