I am trying web scraping with urllib and Beautifulsoup in Python3. Last time, I dealt with a communication error due to Proxy. What to do if there is no response due to Proxy settings in Python web scraping Communication by http worked well with the above method, but when it became an https site, communication was not established and an error occurred. I'm in trouble because recent websites have a lot of https. .. : disappointed_relieved: Adding the "https" item to proxies as shown below does not solve the problem. proxies={"http":"http:proxy.-----.co.jp/proxy.pac", "https":"http:proxy.-----.co.jp/proxy.pac"}
When I was looking up, I found a library called requests. I tried to use it instead of urllib and it was surprisingly easy to solve.
An example of how to use it is as follows.
requsts_sample.py
import requests
proxies = {
"http":"http://proxy.-----.co.jp/proxy.pac",
"https":"http://proxy.-----.co.jp/proxy.pac"
}
r = requests.get('https://github.com/timeline.json', proxies=proxies)
print(r.text)
When using Beautifulsourp, it seems that you should pass the content of the object obtained by requests.get. Here is a simple sample.
python::requests_beautifulsoup_sample.py
import requests
from bs4 import BeautifulSoup
proxies = {
'http':'http://proxy.-----.co.jp/proxy.pac',
'https':'http://proxy.-----.co.jp/proxy.pac'
}
def getBS(url):
html = requests.get(url, proxies=proxies)
bsObj = BeautifulSoup(html.content, "html.parser")
return bsObj
htmlSource = getBS("https://en.wikipedia.org/wiki/Kevin_Bacon")
#Show links that exist on the page
for link in htmlSource.findAll("a"):
if 'href' in link.attrs:
print(link.attrs['href'])
The requests library was included when I installed Python 3.5.2 on Anaconda. You can check the packages installed by Anaconda Navigator. If you installed the GUI on Windows, you can find it in Windows-> All Programs-> Anaconda3-> Anaconda Navigator.
Click here for Quickstart of requests library
Recommended Posts