Story of image analysis of PDF file and data extraction

Introduction

It is open data, which is a hot topic right now, but it is not necessarily raw data such as CSV, so it may be subtly difficult to handle. Of course, the fact that we are releasing new data that we have is something that should be praised, and we are expecting it with the feeling that "I really want raw data." That's why it's not a bad thing because it was published in PDF, but when using the data, it is necessary to make it more machine-readable than PDF. This time, CSV data is based on the PDF data of About the congestion situation in the car during the morning rush hour released by the Sapporo City Transportation Bureau. I will introduce the procedure for training.

policy

The PDF file obtained from the above website has this format スクリーンショット 2020-03-18 21.38.35.png

For the time being, the data is structured, and it feels like Excel is output as a PDF. So, "from the structure to the analysis route" comes to mind, but it seemed that it would take time and effort, such as the Japanese stored in the table being garbled during analysis. That's why I devised a procedure like the image below.

スクリーンショット 2020-03-18 21.38.35.JPG

It was troublesome to use image processing software, so I handwritten it on my iPad. In short, looking at some PDFs, I found that the upper left cell is always in the same position in both the left and right tables, and the cell sizes are all the same. Then, if you extract the RGB value (= color) of the red dot in the image and convert it to congestion degree data, you will get the desired data.

Conclusion

I was able to implement it like this https://github.com/Kanahiro/sapporo_subway_analyze/

Output CSV like this スクリーンショット 2020-03-18 22.07.30.png

You can see that the red part is 4, the white part is 0, and the blue part is 2. Now, let's follow the steps of reading PDF → converting to image → getting the color of the specified pixel → generating the congestion degree data of the cell from the color → converting it to a CSV file.

Convert from PDF to image

Reference: Python PDF processing summary (combination / division, image conversion, password cancellation)

Use pdf2image. See the above article for how to use it. In the article, it says pip install poppler, but currently it cannot be installed with pip. For Linux: The explanation of other OS is omitted.

sudo apt install poppler-utils

Get the RGB value of the specified pixel from the image

The data read by pdf2image can be converted to numpy array. In other words, the RGB array is inserted into the two-dimensional array that matches the pixel structure to form a three-dimensional array.

#convert_from_path is a function of pdf2image
pdf_images = convert_from_path(pdffile)
img_array = np.asarray(pdf_images[0])
'''
img_array sample

[[[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

(Omitted)

 ...

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]

 [[255 255 255]
  [255 255 255]
  [255 255 255]
  ...
  [255 255 255]
  [255 255 255]
  [255 255 255]]]

The edges of the PDF are white, so it's natural, but [255 255 255], that is, it's all white. I was able to store it in an array in pixel units.

Access to a specific pixel (data_of_pixel)


x = START_CELL[0] + c * CELL_SIZE[0] #x coordinate
y = START_CELL[1] + r * CELL_SIZE[1] #y coordinate
data_of_pixel = img_array[y][x]

It will be. Now you can get the RGB value of a specific part of the converted PDF image.

Judgment of congestion

From the image above, you can see that the congestion is higher in the order of white, light blue, blue, yellow, and red. However, in the legend and data area on the upper right, the PDF was finished with slightly different RGB values. Fortunately, the color difference is large for each stage, so I would like to judge here by the size of the difference in RGB values between the legend and the cells in the data area.

#RGB values in the congestion legend
CROWD_RGBs = [
    [255, 255, 255],
    [112, 200, 241],
    [57, 83, 164],
    [246, 235, 20],
    [237, 32, 36]
]


def rgb_to_type(rgb_list)->int:
    #Color difference threshold
    threshold = 50
    color_array = np.asarray(rgb_list)
    for i in range(len(CROWD_RGBs)):
        crowd_rgb_array = np.asarray(CROWD_RGBs[i])
        color_dist = abs(color_array - crowd_rgb_array)
        sum_dist = color_dist.sum()
        if sum_dist < threshold:
            return i #0 -4 How crowded

Passing a list of RGB values to this function will return the degree of congestion as an integer value from 0-4. What we are doing is comparing the RGB value of the pixel we got earlier with the RGB value of the congestion legend. The sum of the absolute values of the differences between the RGB values is defined as the color difference, and if the difference is within 50, the degree of congestion is determined.

With this, the congestion degree of all cells is judged, and if it is converted to CSV, the CSV data at the beginning is completed.

At the end

This kind of processing is a certain story in the open data area. Some people said, "Make rice from rice cakes," but can I call myself a rice brewer? However, after all, rice cannot be smelted without primary data (rice cake), so I am only grateful for that, always thank you. Since the amount of data has increased, I wonder if it will be the next stage of data quality ...