Oops! Recently, I'm Jesse (I'm from Human Wolf J ~ ♪) with only 15 reverse cats! The continuation of the last time has been completed, so it's open to the public! If you like, read from the previous one (https://qiita.com/Jessica_nao_/items/b9f38a4413e424e3e585)!
There is only one URL, but I wanted to extract all the data from the table, which has 20 pages of 100 items each, at once. I got the selenium webdriver to do my best! Oh, as the tag says, I'm using Python3!
--Part2: This page! I used Chrome Webdriver to click on an element in the page to move it! The data was saved as csv ~.
--First, I prepared a function mkFN that creates a file name. --Save the web page source to a text file. Only this, the behavior is different between the first page and the second and subsequent pages, so I wrote the repetition in one function gettxt2! On the page I wanted to gather information on, I could go to the next page by pressing the link that says "Next"! --Next, trimming. --Finally, I saved it in a csv file! I was careful that I had to start a new line for each item / loaded file!
I think it was pretty smart to have a function to create a file name! Lol
Since the part to write the file has appeared many times, this is also
sample.py
def mkFile():
I thought it would have been better to paste the process below into this and divide it into functions.
Also, I think this is probably the most stumbling block, It will take some time to load the page, so be sure to take a break! This is ↓↓
sample.py
sleep(3):
Don't forget to import sleep from time first because you have a break!
sample.py
import re
from selenium import webdriver
from time import sleep
#I thought it would be better to open it with excel, so Shift_I went to JIS once,
#Are there any characters that cannot be displayed? I don't know, but I gave up because I got an error.
mojicode = "utf8"
def mkFN(cnt,typeindex):
types = [
["sample_", ".txt"],
["trimmed_", ".txt"],
["fin_", ".csv"],
]
cntstr = str(cnt)
if len(cntstr) == 1:
cntstr = "0" + cntstr
ans = "data/"
ans += types[typeindex][0] + cntstr + types[typeindex][1]
return ans
def gettxt2(cnt):
url = "https://www.sample.com"
path = "/Users/sample/Downloads/chromedriver"
fn0 = "data/sample"
fn1 = ".txt"
driver = webdriver.Chrome(path)
driver.get(url)
sleep(3)
output = driver.page_source
filename = mkFN(0,0)
with open(filename,"w",encoding=mojicode) as f:
f.write(output)
print(filename + ": done.")
#Does it seem like you have to initialize it again?
output = driver.page_source
sleep(3)
for i in range(1,cnt):
element = driver.find_element_by_link_text("Next")
element.click()
sleep(3)
output = driver.page_source
filename = mkFN(i,0)
with open(filename,"w",encoding=mojicode) as f:
f.write(output)
print(filename + ": done.")
def trimming(cnt):
filename = mkFN(cnt,0)
filename2 = mkFN(cnt,1)
with open(filename) as f:
contents = f.read()
regexen = [
r'<tbody><tr class="jsgrid-row">',
r'</td></tr></tbody></table></div><div class="jsgrid-pager-container"',
]
#The plural of index is indices
indices = [0,0]
for i in range(0,2):
matchObj = re.search(regexen[i],contents)
indices[i] = matchObj.start()
rslt = contents[indices[0]:indices[1]]
with open(filename2,"w",encoding=mojicode) as f2:
f2.write(rslt)
def removeTag(cnt):
beforeAfter = [
[r'</tr>', "\n"],
[r'</td>', ","],
[r'<.*?>', ""],
]
with open(mkFN(cnt,1),encoding=mojicode) as f:
contents = f.read()
for i in range(0,3):
contents = re.sub(beforeAfter[i][0],beforeAfter[i][1],contents)
#Add commas and line breaks at the end of the file!
contents += ",\n"
option = "a"
if cnt == 0:
option = "w"
with open(mkFN("all",2),option,encoding=mojicode) as f:
f.write(contents)
cnt = 20
gettxt2(cnt)
print("gettxt: done!")
sleep(1)
for i in range(0,cnt):
trimming(i)
print("trimming: done!")
sleep(1)
for i in range(0,cnt):
removeTag(i)
print("removeTag: done!")
Because I wrestled for a few hours, I couldn't stop watching the information being added in a few minutes at the end! Lol I hope I get used to it a little more and get ready in about 30 minutes.
--About writing files: https://www.javadrive.jp/python/file/index3.html#section3 --Find the element on the page and click! : Https://www.seleniumqref.com/api/python/element_get/Python_find_element_by_link_text.html
Recommended Posts