camelot is not good at dotted lines and often fails, so when I looked it up, I found the following reference article
Since camelot is extracted with opencv, it seems that you can rewrite the dotted line, so I extracted the dotted line with Hough transform and overwrote it with the solid line and it worked.
[Process the dotted line as a solid line with camelot](https://needtec.sakura.ne.jp/wod07672/2020/05/03/camelot%e3%81%a7%e7%82%b9%e7%b7%9a % e3% 82% 92% e5% ae% 9f% e7% b7% 9a% e3% 81% a8% e3% 81% 97% e3% 81% a6% e5% 87% a6% e7% 90% 86% e3 % 81% 99% e3% 82% 8b /)
I will use the dotted PDF next to this article
https://github.com/mima3/yakusyopdf/blob/master/20200502/%E5%85%B5%E5%BA%AB%E7%9C%8C.pdf
Linear detection by Hough transform of OpenCV
Straight line extraction with Hough transform
Extract only horizontal straight lines by Hough transform
import cv2
import numpy as np
import camelot
#Patch creation
def my_threshold(imagename, process_background=False, blocksize=15, c=-2):
img = cv2.imread(imagename)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
edges = cv2.Canny(gray, 50, 150, apertureSize=3)
lines = cv2.HoughLinesP(
edges, rho=1, theta=np.pi / 180, threshold=80, minLineLength=3000, maxLineGap=50
)
for line in lines:
x1, y1, x2, y2 = line[0]
#Y1 if horizontal==y2, x1 for vertical==Filter by x2 if
cv2.line(img, (x1, y1), (x2, y2), (0, 0, 0), 1)
if process_background:
threshold = cv2.adaptiveThreshold(
gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, blocksize, c
)
else:
threshold = cv2.adaptiveThreshold(
np.invert(gray),
255,
cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
cv2.THRESH_BINARY,
blocksize,
c,
)
return img, threshold
camelot.parsers.lattice.adaptive_threshold = my_threshold
tables = camelot.read_pdf("data.pdf", pages="all")
tables[0].df
Since the dotted line part does not react, it is vertically connected.
Recommended Posts