Oops! The bite is good and it seems that 2 wolves survive, so I will CO the black cat pillar!
If you save the data, you won't be able to understand how to handle commas, and JSON is more stable than CSV! ?? (I don't know ...) You don't need spreadsheets or Excel at all, right? What will I be able to say someday ...?
So, I've made a continuation of the last time, so I'll publish it.
Part1: https://qiita.com/Jessica_nao_/items/b9f38a4413e424e3e585 Practice getting and cleaning the source in one page!
Part2: https://qiita.com/Jessica_nao_/items/76efc0b99ff18c0e6bd7 I used Chrome Webdriver to click on an element in the page to move it! The data was saved as csv ~.
--Part3: This page! I thought that it would be better to use JSON even if there were ":" or "," in the data, so I saved it as a List of JSON List! I think it feels good because I have fixed it a little.
I will explain the functions in order from the top!
--mkFN
: A function that creates a file name!
--gettxt2
: Save the source to text while navigating the web page! This is done so that once you do it, all the pages you want will be displayed in order! This time, there is an element called "Next", and I tried to repeat "Click this → Get the element of the displayed page"!
--trimming
: I cut out all the files and saved them in another txt file!
--removeTagForJSON
: Format it into a JSON file! First of all, I removed the tag and put parentheses so that the last List does not have a comma, and it was troublesome to make fine adjustments ~ crying
--ʻAddBlackets`: At the beginning and the end, I added a [] bracket! I've done a lot of redoing with the above fine adjustments quite a few times! Lol
sample.py
import re
from selenium import webdriver
from time import sleep
#I thought it would be better to open it with excel, so Shift_I went to JIS once, but were there any characters that couldn't be displayed? I don't know, but I gave up because I got an error.
mojicode = "utf8"
def mkFN(cnt,typeindex):
types = [
["sample_", ".txt"],
["trimmed_", ".txt"],
["fin_", ".csv"],
["JSON_fin_",".json"],
]
cntstr = str(cnt)
if len(cntstr) == 1:
cntstr = "0" + cntstr
ans = "data/"
ans += types[typeindex][0] + cntstr + types[typeindex][1]
return ans
def gettxt2(cnt):
url = "https://www.sample.com"
path = "/Users/sample/Downloads/chromedriver"
fn0 = "data/sample"
fn1 = ".txt"
driver = webdriver.Chrome(path)
driver.get(url)
sleep(3)
output = driver.page_source
filename = mkFN(0,0)
with open(filename,"w",encoding=mojicode) as f:
f.write(output)
print(filename + ": done.")
#I don't know why, but it seems like I have to initialize it again?
output = driver.page_source
sleep(3)
for i in range(1,cnt):
element = driver.find_element_by_link_text("Next")
element.click()
sleep(3)
output = driver.page_source
filename = mkFN(i,0)
with open(filename,"w",encoding=mojicode) as f:
f.write(output)
print(filename + ": done.")
def trimming(cnt):
filename = mkFN(cnt,0)
filename2 = mkFN(cnt,1)
with open(filename) as f:
contents = f.read()
regexen = [
r'<tbody><tr class="jsgrid-row">',
r'</table></div><div class="sample"',
]
#The plural of index is indices
indices = [0,0]
for i in range(0,2):
matchObj = re.search(regexen[i],contents)
indices[i] = matchObj.start()
rslt = contents[indices[0]:indices[1]]
with open(filename2,"w",encoding=mojicode) as f2:
f2.write(rslt)
def removeTagForJSON(cnt):
beforeAfter = [
[r'<tr.*?><td.*?>','\t["'],
[r'</td><td.*?>','","'],
[r'</td></tr>','"],\n'],
[r'<.*?>', ""],
]
with open(mkFN(cnt,1),encoding=mojicode) as f:
contents = f.read()
for i in range(0,4):
contents = re.sub(beforeAfter[i][0],beforeAfter[i][1],contents)
option = "a"
if cnt == 0:
option = "w"
with open(mkFN("all",1),option,encoding=mojicode) as f:
f.write(contents)
def addBlackets():
with open(mkFN("all",1),encoding=mojicode) as f:
contents = f.read()
contents = "[\n" + contents[0:-2] + '\n]'
option = "w"
with open(mkFN("all2",3),option,encoding=mojicode) as f:
f.write(contents)
cnt = 20
gettxt2(cnt)
sleep(2)
for i in range(0,cnt):
trimming(i)
print("trimming: done!")
for i in range(0,cnt):
removeTagForJSON(i)
sleep(1)
addBlackets()
This time it's about 20 pages, so I made files one by one, but it seems to be difficult if I do not write it after properly shaping it when transitioning to 8000 pages! Lol But as long as I don't go to 100100 pages, is it okay to leave it as it is?
Recommended Posts