Oops! This is Jesse. I've just survived from the mansion, and I'm just a beginner. I'll put it together so that I can refer to it, just in case there are about the same number of people! Oh, and I'm writing because I'm lucky if a strong person opens up and gives me some advice! Lol
Oh, I'm using python3!
--Part1: This page! Practice getting and cleaning the source in one page!
Part2: https://qiita.com/Jessica_nao_/items/76efc0b99ff18c0e6bd7 I used Chrome Webdriver to click on an element in the page to move it! The data was saved as csv ~.
Part3: https://qiita.com/Jessica_nao_/items/140b435a9e13054ed78e I thought that it would be better to use JSON even if there were ":" or "," in the data, so I saved it as a List of JSON List! I think it feels good because I have fixed it a little.
In turn, write what you did.
I was wondering if I would do it with Requests, but is this limited to those with public APIs? I didn't know, so I imported and used selenium's webdriver.
If you didn't have selenium installed Mac: Try running "python3 -m pip install selenium" in your terminal! (I don't know why, but I'm done with this!) Windows: It would be nice to see this! → https://www.seleniumqref.com/introduction/python/Python_Sele_Ins.html
sample.py
import re
from selenium import webdriver
from time import sleep
def gettxt():
#Specify the URL of the web page you want to extract the source from, the path of the chromedriver, and the file name to save the contents!
url = "https://www.sample.com"
path = "/Users/sample/Downloads/chromedriver"
filename = "data/sample.txt"
driver = webdriver.Chrome(path)
driver.get(url)
sleep(5)
output = driver.page_source
with open(filename,"w",encoding="utf8") as f:
f.write(output)
sleep(3)
driver.quit()
sample.py
def trimming():
#Specify the name of the file to read and the file to write!
filename = "data/sample.txt"
filename2 = "data/sample2.txt"
with open(filename) as f:
cntnt = f.read()
#Read the source of the place you want to trim and enter the start and end strings!
regexen = [
r'<tbody><tr class="The beginning class of the information you want">',
r'</td></tr></tbody></table></div><div class="Next class with the information you want"',
]
#The plural of index is indices
indices = [0,0]
for i in range(0,2):
matchObj = re.search(regexen[i],cntnt)
indices[i] = matchObj.start()
rslt = cntnt[indices[0]:indices[1]]
with open(filename2,"w",encoding="utf8") as f2:
f2.write(rslt)
TypeError: expected string or bytes-like object I got this error, It was a pain to just forget the next last ()! Lol () Is required for functions that do not take arguments, so I'm not used to it.
sample.py
matchObj.start()
First, convert the line breaks. Next, add a comma. Finally, I erased all the tags!
I thought it would be nice to write it in a loop of numbers and a list, but so that you can see the correspondence, Dictionary or Map was better! Lol
I wrote the commented part again below!
sample.py
def removeTag():
filename2 = "data/sample2.txt"
filename3 = "data/sample3.csv"
#Any single character is an arbitrary character string with 0 or more characters repeated!
regex0 = r'<.*>'
#But above, the very beginning<From the last>Don't get everything up to.
#Behind the asterisk?If you add, it will be picked up from the front.
regex = r'<.*?>'
regexen = [
r'</tr>',
r'</td>',
r'<.*?>',
]
after = [
"\n",
",",
"",
]
with open(filename2) as f:
contents = f.read()
for i in range(0,3):
contents = re.sub(regexen[i],after[i],contents)
contents += ","
with open(filename3,"w") as f:
f.write(contents)
I didn't know it at all, but when I searched for a regular expression, I didn't find the first match from the front! Rather, they usually fetch the longest one. "Normal" is difficult ...
sample.py
#Any single character is an arbitrary character string with 0 or more characters repeated!
regex0 = r'<.*>'
#But above, the very beginning<From the last>Don't get everything up to.
#Behind the asterisk?If you add, it will be picked up from the front.
regex = r'<.*?>'
--I checked the plural form of regex: https://ejje.weblio.jp/content/regexen --Basic usage of Python regular expressions: https://uxmilk.jp/41416 --Regular expression: Match with the shortest match: http://www-creators.com/archives/1804 --Selenium API (reverse lookup): https://www.seleniumqref.com/api/webdriver_gyaku.html
I will update it if I can do something in the future!
Recommended Posts