――I personally summarize the transition of the number of downloads and the number of positive registrations of the contact confirmation application (COCOA) in a graph. I visited the official website around 18:00 every day, wrote the data to Google Sheet, and created a graph. ――However, this time and effort has become difficult, so I wondered if I could automate simple tasks. ――The information posted by the Ministry of Health, Labor and Welfare is an image of the text, and I thought that if this data could be automatically analyzed, it would be possible to automate the whole, so I made it as a trial.
Ministry of Health, Labor and Welfare COCOA special site (as of 8/11)
――Since the information to be posted is not posted as text data, it was necessary to acquire an image and perform character recognition (OCR). Therefore, "GCP Cloud Vision" and "** Tesseract **" were listed as candidates for OCR tools.
――This time, it was said that Tesseract can be easily used by using the PyOCR library in Python, so we will adopt this and verify the recognition accuracy. In the future, I would like to try using CloudVision and consider the accuracy of character recognition on both sides.
--Scraping function --OCR function --Data extraction function --Data writing function to Google spreadsheet --Image acquisition function for graphs created with Google Spreadsheet (Material for Tweet) --A function to post a graph of COCOA usage status to Twitter (because API has not been acquired, operation has not been confirmed)
Open source software that runs on a variety of operating systems and is distributed under the Apache License 2.0. It has a library for character recognition and a command line interface using it. From version 4.0, in addition to the conventional recognition engine, a recognition engine using an LSTM-based neural network is installed. Developer: Google --From wikipedia
--Before OCR processing (image obtained from the site) --After OCR processing
The contact confirmation app is currently "1" for both iOS and Android..1.2 "is distributed.
If you are using an older version of the app, please go to the App Store or Google Play.
Please search for "Approved App" and update.
The number of downloads is August 7, 17:As of 00, about 1 in total.There are 2.05 million cases.
・ It is the total number of both iOS and Android.
・ If you delete it after downloading and download it again, it will be counted multiple times.
There is a match.
The number of positive registrations is August 7, 17:As of 00, there are a total of 165 cases.
The only typographical error is that the "app" on the second line is recognized as "appli". Therefore, it was found that there is no problem in extracting the number of downloads and the number of positive registrations when extracting the data. We performed OCR processing on multiple sheets, but it was fairly stable and the data extraction was performed accurately.
――This system is operating normally except for the Twitter posting function, and we were able to simplify the update work by automatically creating graphs from data extraction. --Currently, only posting tweets is manual. ――As an issue, in the future, we would like to activate the tweet function after the Twitter API is issued to automate the process from analysis to information transmission.
Graph of changes in the number of downloads and the number of positive registrations automatically acquired from Google Sheet
--Ministry of Health, Labor and Welfare special site "New Coronavirus Contact Confirmation Application (COCOA) COVID-19 Contact-Confirming Application" https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/cocoa_00138.html --About the introduction and usage of Tesseract https://rightcode.co.jp/blog/information-technology/python-tesseract-image-processing-ocr
Recommended Posts