: warning: This article does not recommend scraping with Tor.
Scraping is basically fine, but you may be guilty if it is prohibited by the terms of use of the target site or if you overload the server of the target site.
It is a technology to anonymize the connection route. In theory, when accessed using Tor, it is difficult to determine who accessed it.
Homebrew 2.2.4
pip 20.0.2
Python 3.7.3
First, let's check the global IP address without Tor. The global IP address is here, and if you are using Tor, you can get the HTML from here. You can check it.
It uses Beautiful Soup, so please install it.
#Install beautifulsoup4 with pip
$ pip install beautifulsoup4
#Verification
$ pip list | grep beautifulsoup4
beautifulsoup4 4.7.1
import urllib.request, urllib.error
from bs4 import BeautifulSoup
#Returns HTML from URL
def fetch_html(url):
res = urllib.request.urlopen(url)
return BeautifulSoup(res, 'html.parser')
#Returns the current global IP address
def get_ip_addr():
html = fetch_html('http://checkip.dyndns.com/')
return html.body.text.split(': ')[1]
#Returns if you are using Tor
def check_use_tor():
html = fetch_html('https://check.torproject.org/')
return html.find('h1')['class'][0] != 'off'
print('You are using tor.' if check_use_tor() else 'You are not using tor.')
print('Current IP address is ' + get_ip_addr())
Execution result
You are not using tor.
Current IP address is XXX.XXX.XX.XXX
If you're using MacOS, you can install it with Homebrew.
I'm also using brew services start
to start it as a daemon.
$ brew install tor
$ brew services start tor
#Verification
$ tor --version
Tor version 0.4.2.6.
$ brew services list | grep tor
tor started your_name /Users/your_name/Library/LaunchAgents/homebrew.mxcl.tor.plist
To stop Tor or to restart it, execute the following command.
$ brew services stop tor
$ brew services reload tor
Also, although not mentioned in this article, the config file is / usr / local / etc / tor / torrc
.
It uses PySocks, so please install it.
$ pip install PySocks
#Verification
$ pip list | grep PySocks
PySocks 1.7.1
Tor uses socks 5: // localhost: 9050
as a proxy, so add the following to your ** 1. ** code:
import socks, socket
socks.set_default_proxy(socks.PROXY_TYPE_SOCKS5, '127.0.0.1', 9050)
socket.socket = socks.socksocket
Execution result
You are using tor.
Current IP address is YY.YYY.YYY.YY
Make sure that the global IP address displayed is different than when you ran it with ** 1. **. The IP address when using Tor changes at regular intervals.
Recommended Posts