This time I wrote the code to collect text from the website using python and selenium, so I will summarize it.
Originally, selenium is for automatically testing web applications, but you can operate a web browser to operate a website.
To explain how we decided to scrape the web with Python and Selenium this time.
For the above reason, use not only urlopen of urllib.request, which is often used for web scraping, but also selenium.
Basic web scraping flow of selenium and python
from selenium import webdriver
from bs4 import BeautifulSoup
class Crawler(object):
def main(self, url):
if url is not None:
#Exception handling
try:
browser = webdriver.PhantomJS() #Create an object to operate the browser
browser.get(url) #Access URL
except:
~~~
html_source = browser.page_source #Returns the page source of the visited site
bs_obj = BeautifulSoup(html_source) #Creates a BeautifulSoup object with the page source as an argument
print(url)
print(html_source)
print(bs_obj)
browser.quit()
if __name__ == "__main__":
cw = Crawler()
cw.main(http://www.yahoo.co.jp/)
Selenium/BeautifulSoup -Basic usage of selenium -Basic usage of beautiful soup
Recommended Posts