Process the dotted line as a solid line with camelot (Hough transform) https://qiita.com/barobaro/items/af850ac29dbc983eb39b
Again, camelot is not good at extracting tables other than solid lines. It seems that it can be easily extracted with pdfplumber
Go To EAT Business Official Site Shiga Prefecture Characters are not recognized, can be extracted with camelot
wget https://www.mhlw.go.jp/content/000691131.pdf -O data.pdf
pip install pdfplumber
import pdfplumber
import pandas as pd
with pdfplumber.open("data.pdf") as pdf:
dfs = []
for page in pdf.pages:
data = page.extract_table()
df_tmp = pd.DataFrame(data[2:], columns=data[1])
dfs.append(df_tmp)
df = pd.concat(dfs)
df.to_csv("hyogo.csv", encoding="utf_8_sig")
https://www.chiba-gte.jp/downloads/store_list.pdf
wget https://www.chiba-gte.jp/downloads/store_list.pdf -O data.pdf
import pdfplumber
import pandas as pd
with pdfplumber.open("data.pdf") as pdf:
dfs = []
for page in pdf.pages:
data = page.extract_table()
df_tmp = pd.DataFrame(data)
dfs.append(df_tmp)
df = pd.concat(dfs)
df1 = df.mask(df.isna() | (df == "")).dropna(thresh=4)
df2 = df1[df1[0] != "paper"].reset_index(drop=True)
df2.set_axis(["paper", "Electronic", "Store name", "Street address", "TEL"], axis=1, inplace=True)
df2.index += 1
df2.to_csv("data.csv")
Recommended Posts