Introduction

camelot is not good at dotted lines and often fails, so when I looked it up, I found the following reference article

Since camelot is extracted with opencv, it seems that you can rewrite the dotted line, so I extracted the dotted line with Hough transform and overwrote it with the solid line and it worked.

reference

Using Python makes it easy to parse PDFs containing text ... I had a time when I was thinking that way

[Process the dotted line as a solid line with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7%b7%9a % e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3 % 81% 99% e3% 82% 8b /)

I will use the dotted PDF next to this article

https://github.com/mima3/yakusyopdf/blob/master/20200502/%E5%85%B5%E5%BA%AB%E7%9C%8C.pdf

Hough transform

Linear detection by Hough transform of OpenCV

Straight line extraction with Hough transform

PDF of list of member stores of Go To Eat in Chiba

data1(1).png

Extract only horizontal straight lines by Hough transform

houghline(1).png

program

import cv2
import numpy as np

import camelot

#Patch creation

def my_threshold(imagename, process_background=False, blocksize=15, c=-2):

    img = cv2.imread(imagename)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    edges = cv2.Canny(gray, 50, 150, apertureSize=3)

    lines = cv2.HoughLinesP(
        edges, rho=1, theta=np.pi / 180, threshold=80, minLineLength=3000, maxLineGap=50
    )

    for line in lines:
        x1, y1, x2, y2 = line[0]
        #Y1 if horizontal==y2, x1 for vertical==Filter by x2 if
        cv2.line(img, (x1, y1), (x2, y2), (0, 0, 0), 1) 

    if process_background:
        threshold = cv2.adaptiveThreshold(
            gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blocksize, c
        )
    else:
        threshold = cv2.adaptiveThreshold(
            np.invert(gray),
            255,
            cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,
            blocksize,
            c,
        )
    return img, threshold

camelot.parsers.lattice.adaptive_threshold = my_threshold

tables = camelot.read_pdf("data.pdf", pages="all")

tables[0].df

Before patch description

Since the dotted line part does not react, it is vertically connected. Screenshot_2020-11-04 Google Colaboratory(1).png

After patch abstract

Screenshot_2020-11-04 Google Colaboratory.png

Process the dotted line as a solid line with camelot (Hough transform)