at first

Oops! Recently, I'm Jesse (I'm from Human Wolf J ~ ♪) with only 15 reverse cats! The continuation of the last time has been completed, so it's open to the public! If you like, read from the previous one (https://qiita.com/Jessica_nao_/items/b9f38a4413e424e3e585)!

Thing you want to do

There is only one URL, but I wanted to extract all the data from the table, which has 20 pages of 100 items each, at once. I got the selenium webdriver to do my best! Oh, as the tag says, I'm using Python3!

Crawling x scraping with Python and Selenium!

Part1: https://qiita.com/Jessica_nao_/items/b9f38a4413e424e3e585 Practice getting and cleaning the source in one page!

--Part2: This page! I used Chrome Webdriver to click on an element in the page to move it! The data was saved as csv ~.

Part3: https://qiita.com/Jessica_nao_/items/140b435a9e13054ed78e I thought that it would be better to use JSON even if there were ":" or "," in the data, so I saved it as a List of JSON List! I think it feels good because I have fixed it a little.

--First, I prepared a function mkFN that creates a file name. --Save the web page source to a text file. Only this, the behavior is different between the first page and the second and subsequent pages, so I wrote the repetition in one function gettxt2! On the page I wanted to gather information on, I could go to the next page by pressing the link that says "Next"! --Next, trimming. --Finally, I saved it in a csv file! I was careful that I had to start a new line for each item / loaded file!

Reflection

I think it was pretty smart to have a function to create a file name! Lol

Since the part to write the file has appeared many times, this is also

`sample.py`


 def mkFile():

I thought it would have been better to paste the process below into this and divide it into functions.

Also, I think this is probably the most stumbling block, It will take some time to load the page, so be sure to take a break! This is ↓↓

`sample.py`


 sleep(3):

Don't forget to import sleep from time first because you have a break!

Deliverables

`sample.py`


import re
from selenium import webdriver
from time import sleep

#I thought it would be better to open it with excel, so Shift_I went to JIS once,
#Are there any characters that cannot be displayed? I don't know, but I gave up because I got an error.
mojicode = "utf8"

def mkFN(cnt,typeindex):
    types = [
        ["sample_", ".txt"],
        ["trimmed_", ".txt"],
        ["fin_", ".csv"],
    ]
    cntstr = str(cnt)
    if len(cntstr) == 1:
        cntstr = "0" + cntstr
    ans = "data/"
    ans += types[typeindex][0] + cntstr + types[typeindex][1]
    return ans

def gettxt2(cnt):
    url = "https://www.sample.com"
    path = "/Users/sample/Downloads/chromedriver"
    fn0 = "data/sample"
    fn1 = ".txt"
    
    driver = webdriver.Chrome(path)
    driver.get(url)
    sleep(3)
    output = driver.page_source
    filename = mkFN(0,0)
    with open(filename,"w",encoding=mojicode) as f:
        f.write(output)
    print(filename + ": done.")

    #Does it seem like you have to initialize it again?
    output = driver.page_source
    sleep(3)

    for i in range(1,cnt):
        element = driver.find_element_by_link_text("Next")
        element.click()
        sleep(3)
        output = driver.page_source

        filename = mkFN(i,0)
        with open(filename,"w",encoding=mojicode) as f:
            f.write(output)
        print(filename + ": done.")


def trimming(cnt):
    filename = mkFN(cnt,0)
    filename2 = mkFN(cnt,1)
    with open(filename) as f:
        contents = f.read()
    regexen = [
        r'<tbody><tr class="jsgrid-row">',
        r'</td></tr></tbody></table></div><div class="jsgrid-pager-container"',
    ]
    #The plural of index is indices
    indices = [0,0]

    for i in range(0,2):
        matchObj = re.search(regexen[i],contents)
        indices[i] = matchObj.start()
    rslt = contents[indices[0]:indices[1]]

    with open(filename2,"w",encoding=mojicode) as f2:
        f2.write(rslt)

def removeTag(cnt):

    beforeAfter = [
        [r'</tr>', "\n"],
        [r'</td>', ","],
        [r'<.*?>', ""],
    ]

    with open(mkFN(cnt,1),encoding=mojicode) as f:
        contents = f.read()

    for i in range(0,3):
        contents = re.sub(beforeAfter[i][0],beforeAfter[i][1],contents)

    #Add commas and line breaks at the end of the file!
    contents += ",\n"
    
    option = "a"
    if cnt == 0:
        option = "w"
    with open(mkFN("all",2),option,encoding=mojicode) as f:
        f.write(contents)


cnt = 20
gettxt2(cnt)

print("gettxt: done!")
sleep(1)


for i in range(0,cnt):
    trimming(i)
print("trimming: done!")

sleep(1)

for i in range(0,cnt):
    removeTag(i)
print("removeTag: done!")

At the end

Because I wrestled for a few hours, I couldn't stop watching the information being added in a few minutes at the end! Lol I hope I get used to it a little more and get ready in about 30 minutes.

reference

--About writing files: https://www.javadrive.jp/python/file/index3.html#section3 --Find the element on the page and click! : Https://www.seleniumqref.com/api/python/element_get/Python_find_element_by_link_text.html

[Part.2] Crawling with Python! Click the web page to move!

at first

Thing you want to do

Crawling x scraping with Python and Selenium!

Contents

Reflection

`sample.py`

`sample.py`

Deliverables

`sample.py`

At the end

reference