at first

Oops! This is Jesse. I've just survived from the mansion, and I'm just a beginner. I'll put it together so that I can refer to it, just in case there are about the same number of people! Oh, and I'm writing because I'm lucky if a strong person opens up and gives me some advice! Lol

Oh, I'm using python3!

Crawling x scraping with Python and Selenium!

--Part1: This page! Practice getting and cleaning the source in one page!

Part2: https://qiita.com/Jessica_nao_/items/76efc0b99ff18c0e6bd7 I used Chrome Webdriver to click on an element in the page to move it! The data was saved as csv ~.
Part3: https://qiita.com/Jessica_nao_/items/140b435a9e13054ed78e I thought that it would be better to use JSON even if there were ":" or "," in the data, so I saved it as a List of JSON List! I think it feels good because I have fixed it a little.

In turn, write what you did.

Get the source from the URL

I was wondering if I would do it with Requests, but is this limited to those with public APIs? I didn't know, so I imported and used selenium's webdriver.

If you didn't have selenium installed Mac: Try running "python3 -m pip install selenium" in your terminal! (I don't know why, but I'm done with this!) Windows: It would be nice to see this! → https://www.seleniumqref.com/introduction/python/Python_Sele_Ins.html

`sample.py`


import re
from selenium import webdriver
from time import sleep

def gettxt():
    #Specify the URL of the web page you want to extract the source from, the path of the chromedriver, and the file name to save the contents!
    url = "https://www.sample.com"
    path = "/Users/sample/Downloads/chromedriver"
    filename = "data/sample.txt"

    driver = webdriver.Chrome(path)
    driver.get(url)
    sleep(5)
    output = driver.page_source

    with open(filename,"w",encoding="utf8") as f:
        f.write(output)
    sleep(3)
    driver.quit()

Bassari cut the unnecessary part!

`sample.py`


def trimming():
    #Specify the name of the file to read and the file to write!
    filename = "data/sample.txt"
    filename2 = "data/sample2.txt"
    with open(filename) as f:
        cntnt = f.read()

    #Read the source of the place you want to trim and enter the start and end strings!
    regexen = [
        r'<tbody><tr class="The beginning class of the information you want">',
        r'</td></tr></tbody></table></div><div class="Next class with the information you want"',
    ]
    #The plural of index is indices
    indices = [0,0]

    for i in range(0,2):
        matchObj = re.search(regexen[i],cntnt)
        indices[i] = matchObj.start()
    rslt = cntnt[indices[0]:indices[1]]

    with open(filename2,"w",encoding="utf8") as f2:
        f2.write(rslt)

trouble shooting

TypeError: expected string or bytes-like object I got this error, It was a pain to just forget the next last ()! Lol () Is required for functions that do not take arguments, so I'm not used to it.

`sample.py`


    matchObj.start()

Remove the tag

First, convert the line breaks. Next, add a comma. Finally, I erased all the tags!

I thought it would be nice to write it in a loop of numbers and a list, but so that you can see the correspondence, Dictionary or Map was better! Lol

I wrote the commented part again below!

`sample.py`


def removeTag():
    filename2 = "data/sample2.txt"
    filename3 = "data/sample3.csv"

    #Any single character is an arbitrary character string with 0 or more characters repeated!
    regex0 = r'<.*>'
    #But above, the very beginning<From the last>Don't get everything up to.
    #Behind the asterisk?If you add, it will be picked up from the front.
    regex = r'<.*?>'

    regexen = [
        r'</tr>',
        r'</td>',
        r'<.*?>',
    ]
    after = [
        "\n",
        ",",
        "",
    ]

    with open(filename2) as f:
        contents = f.read()

    for i in range(0,3):
        contents = re.sub(regexen[i],after[i],contents)

    contents += ","

    with open(filename3,"w") as f:
        f.write(contents)

Regular expression shortest match

I didn't know it at all, but when I searched for a regular expression, I didn't find the first match from the front! Rather, they usually fetch the longest one. "Normal" is difficult ...

`sample.py`


    #Any single character is an arbitrary character string with 0 or more characters repeated!
    regex0 = r'<.*>'
    #But above, the very beginning<From the last>Don't get everything up to.
    #Behind the asterisk?If you add, it will be picked up from the front.
    regex = r'<.*?>'

Unpublished reference site

--I checked the plural form of regex: https://ejje.weblio.jp/content/regexen --Basic usage of Python regular expressions: https://uxmilk.jp/41416 --Regular expression: Match with the shortest match: http://www-creators.com/archives/1804 --Selenium API (reverse lookup): https://www.seleniumqref.com/api/webdriver_gyaku.html

At the end

I will update it if I can do something in the future!

[Part1] Scraping with Python → Organize to csv!

at first

Crawling x scraping with Python and Selenium!

Contents

Get the source from the URL

`sample.py`

Bassari cut the unnecessary part!

`sample.py`

trouble shooting

`sample.py`

Remove the tag

`sample.py`

Regular expression shortest match

`sample.py`

Unpublished reference site

At the end