Process the dotted line as a solid line with camelot (Hough transform)

Introduction

camelot is not good at dotted lines and often fails, so when I looked it up, I found the following reference article

Since camelot is extracted with opencv, it seems that you can rewrite the dotted line, so I extracted the dotted line with Hough transform and overwrote it with the solid line and it worked.

reference

Using Python makes it easy to parse PDFs containing text ... I had a time when I was thinking that way

[Process the dotted line as a solid line with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7%b7%9a % e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3 % 81% 99% e3% 82% 8b /)

I will use the dotted PDF next to this article

https://github.com/mima3/yakusyopdf/blob/master/20200502/%E5%85%B5%E5%BA%AB%E7%9C%8C.pdf

Hough transform

Linear detection by Hough transform of OpenCV

data1.png

Straight line extraction with Hough transform

houghline.png

PDF of list of member stores of Go To Eat in Chiba

data1(1).png

Extract only horizontal straight lines by Hough transform

houghline(1).png

program

import cv2
import numpy as np

import camelot

#Patch creation

def my_threshold(imagename, process_background=False, blocksize=15, c=-2):

    img = cv2.imread(imagename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    edges = cv2.Canny(gray, 50, 150, apertureSize=3)

    lines = cv2.HoughLinesP(
        edges, rho=1, theta=np.pi / 180, threshold=80, minLineLength=3000, maxLineGap=50
    )

    for line in lines:
        x1, y1, x2, y2 = line[0]
        #Y1 if horizontal==y2, x1 for vertical==Filter by x2 if
        cv2.line(img, (x1, y1), (x2, y2), (0, 0, 0), 1) 

    if process_background:
        threshold = cv2.adaptiveThreshold(
            gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blocksize, c
        )
    else:
        threshold = cv2.adaptiveThreshold(
            np.invert(gray),
            255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,
            blocksize,
            c,
        )
    return img, threshold

camelot.parsers.lattice.adaptive_threshold = my_threshold

tables = camelot.read_pdf("data.pdf", pages="all")

tables[0].df

Before patch description

Since the dotted line part does not react, it is vertically connected. Screenshot_2020-11-04 Google Colaboratory(1).png

After patch abstract

Screenshot_2020-11-04 Google Colaboratory.png

Recommended Posts

Process the dotted line as a solid line with camelot (Hough transform)
Visualize railway line data as a graph with Cytoscape 2
Process the files in the folder in order with a shell script
Kill the process with sudo kill -9
[Python] Create a program to delete line breaks in the clipboard + Register as a shortcut with windows
Process the contents of the file in order with a shell script
Save the result of the life game as a gif with python
The story of making a university 100 yen breakfast LINE bot with Python
[Introduction to Udemy Python3 + Application] 47. Process the dictionary with a for statement
I made a GAN with Keras, so I made a video of the learning process.