Introduction

Below is a list of institutions that support online medical care based on the spread of the new coronavirus infection.

https://www.mhlw.go.jp/stf/seisakunitsuite/bunya/kenkou_iryou/iryou/rinsyo/index_00014.html

Let's use this to create a process to find out what the online medical care service in the neighborhood looks like. We will also consider whether this will allow us to use the PDF data provided by the government.

Deliverables

https://needtec.sakura.ne.jp/yakusyopdf/ https://github.com/mima3/yakusyopdf

If you enter the longitude and latitude and search ...

A list of nearby hospitals will be displayed, click the line.

Then detailed information will be displayed on the map.

Description

Process flow

(1) Obtain a PDF from the homepage of the list of institutions that support online medical care based on the spread of the new coronavirus infection.

(2) Extract the table information from PDF and convert it to JSON. Please refer to the following page for the processing of this area. ・ [Convert PDF of Ministry of Health, Labor and Welfare to CSV or JSON](https://needtec.sakura.ne.jp/wod07672/2020/04/29/%e5%8e%9a%e7%94%9f%e5% 8a% b4% e5% 83% 8d% e7% 9c% 81% e3% 81% aepdf% e3% 82% 92csv% e3% 82% 84json% e3% 81% ab% e5% a4% 89% e6% 8f% 9b% e3% 81% 99% e3% 82% 8b /)

(3) Combine the JSON of each prefecture into one JSON.

(4) Calculate and record the longitude and latitude from the address in JSON using the Yahoo! Geocoder API.

(5) Store it in the database and display it on the screen based on that information.

Problems when extracting tables from PDF.

To tell the truth, extracting the data in the table from PDF is quite troublesome. There are libraries called tabula and camelot, but that's not enough. There is none. This section describes the problems that occurred using camelot.

Separation of data containing long characters does not work.

For example, suppose you have data that extends beyond the contents of the cell as shown below.

In this case, the phone number and URL are detected as one unit. This time, as a method to deal with this problem, if there is no required item, we check the left and right cells to see if they are merged. However, if the target is an unformatted data column such as a zip code or phone number, it cannot be restored.

Memory usage issues

camelot consumes a lot of memory, so it's better to process it page by page. I think this area will be helpful. https://github.com/camelot-dev/camelot/issues/28

Also, due to the problem of the paper size of the data handled this time, it is better to operate with a 64-bit process because the memory is insufficient if it is a 32-bit process even for each page.

Dotted line cannot be detected.

Although it is mentioned in the following issue, camelot does not recognize well if you make a table with a dotted line.

Detect dotted line #370 https://github.com/atlanhq/camelot/issues/370

Here's how to solve this problem: [Treat dotted lines as solid lines with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7%b7%9a % e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3 % 81% 99% e3% 82% 8b /)

Problems with PDF of the list of supported organizations

This is extremely troublesome, and I think that PDF analysis processing is impossible fully automatically.

size of paper

It takes time to process because some papers have a side size of 1m. Also, it is impossible to process after converting a common PDF to WORD because the size is too large.

The PDF of the list of compatible medical institutions is updated daily, and the URL also changes.

The PDF of the list of supported medical institutions is updated daily, but the URL seems to change with each update. For example, the URL of Tokyo is as follows.

As of April 28, 2020 https://www.mhlw.go.jp/content/000625693.pdf

As of April 29, 2020 https://www.mhlw.go.jp/content/000626106.pdf

For this reason, the PDF URL must be obtained from the link for online medical care based on the spread of the new coronavirus infection.

Presence or absence of legend

The first line of data may or may not contain a legend. There is a legend in the case of Tokyo, but not in Hokkaido.

In other words, it is necessary to adjust the acquisition position of the data row on the first page for each prefecture.

The handling of page headers differs for each prefecture.

For example, compare the PDFs of Tokyo and Ibaraki prefectures. The header is included in the second and subsequent pages of Tokyo, but not in Ibaraki.

In other words, it is necessary to adjust the acquisition position of the data row on the second and subsequent pages for each prefecture. Moreover, this is not always the case even in the same prefecture.

In fact, until April, Hokkaido had header lines on the second and subsequent pages. Probably, this is because the specification at the time of file output sometimes fluctuates.

The handling of items is different for each prefecture.

For example, compare Hokkaido with Aichi and Yamanashi prefectures.

Hokkaido

** Aichi Prefecture **

Yamanashi Prefecture

The items in the column can vary from prefecture to prefecture, and even if they are common items, you need to adjust their position. Furthermore, even in the same prefecture, the items are not always the same. In fact, until April, Yamanashi Prefecture did not divide the line between telephone consultation and online consultation.

Notation blur

There is an item "whether or not medical treatment is carried out using the telephone etc. for the first visit", but in many cases, 〇 or × (or blank) is written, but the notation is incorrect. For example, the following annotations may be made.

○
* Scheduled to be done in the future

Then, it is not a matter of taking only the first letter, and the expressions are diverse. At least at the moment, there are the following notational blurs.

** An expression that indicates the existence of "whether or not medical treatment is carried out using the telephone for the first visit" **

letter	code
〇	E38087
○	E2978B
◯	E297AF
△	E296B3
Yes	E58FAF
●	E2978F
▲	E296B2

** An expression that indicates the absence of "whether or not medical treatment is provided using the telephone for the first visit" **

letter	code
Blank
×	C397
ｘ	EFBD98
☓	E29893
✕	E29C95
X	58
-	2D
－	EFBC8D
Ｘ	EFBCB8
✖	E29C96
no	E590A6

Summary

This time, I converted the PDF published by the Ministry of Health, Labor and Welfare so that it can be easily processed by a computer, and created a Web application using it.

Data can be converted automatically up to a certain point, but as long as PDF is used, full automation is impossible. Also, even if you fix it by hand, the update frequency is high, so there are some strict points.

** If you need at least accurate data and update frequently, it is safe to avoid extracting data from PDF like this time. ** **

Also, if you are in a position to publish the data, we would appreciate it if you could consider the following points.

――Can you publish it in other than PDF? ――When using data, Excel is better. PDFs look the same but are much more difficult. --For data that may be updated, include the last update date ――Avoid changing the format for each prefecture. ――I think you're doing it in good faith and easy to see, but it's getting harder for the machine to process. --Unify the expressions as much as possible.

that's all.

I tried using PDF data of online medical care based on the spread of the new coronavirus infection