I was writing it because I wanted to scrape it with python. If it is accessed from the same IP address for a certain period of time, access will be denied for a while. If a site like this appears, you may not be able to scrape well, so I'm trying to scrape by spoofing the IP address.
However, since it is a confirmation of operation only on macOS, I think that the method is slightly different especially for windows.
By the way, disguise gives a bad impression, but it does not mean that it is bad. Of course, when scraping, please consider the execution time of the program so as not to put a load on the target server.
Please install the 3 series. (I think that it will work with 2 systems, but the operation has not been confirmed)
A library that calls an external URL (API) from python. It's like ajax in javascript.
Install with the following command
pip install requests
It is a library that can take the contents with more detailed conditions after getting the text with request.
pip install beautifulsoup4
tor
It is a tor that allows anonymous communication. Use this for IP spoofing. https://www.torproject.org/
Install with the following command.
brew install tor
After the installation is complete, enter the following command
tor
Various processes will start. It is completed when the following conditions are met.
Jan 28 00:29:59.000 [notice] Bootstrapped 100% (done): Done
Then start tor.
brew services start tor
It's OK if you get ** successfully ** English.
Let's write python. This time, I accessed the URL to get my own IP address and looked at the result.
You can check your IP address at the following site. https://grupo.jp/myip/
test.py
#UTF-8
import requests
from bs4 import BeautifulSoup
get = requests.get('http://httpbin.org/ip').text
soup = BeautifulSoup(get, 'html.parser')
ip = soup.find('table', class_='pubwaku')
print(get)
First of all, normal scraping execution
python test.py
The following results will be returned. A lot of HTML data will be returned, but look for the location where the IP address and remote host are written as shown below.
<tr><th>IP address</th><td style="font-size:18px;font-weight:bold;">153.999.999.99</td><td class="commentary">現在、接続されるIP address</td></tr>
<tr><th>Remote host</th><td>p554999-************.*****.ne.jp</td><td class="commentary">Host name associated with an IP address</td></tr>
** IP address ** 153.999.999.99
** Remote host ** p554999-*******..ne.jp
test.py
#UTF-8
import requests
from bs4 import BeautifulSoup
get = requests.get('https://grupo.jp/myip/',
proxies=dict(http='socks5://127.0.0.1:9050',
https='socks5://127.0.0.1:9050')).text
soup = BeautifulSoup(get, 'html.parser')
ip = soup.find('table', class_='pubwaku')
print(ip)
Added proxies part in requests.
Run
python test.py
Let's see the result. Look again for the location where the IP and remote host are written.
The following results will be returned. A lot of HTML data will be returned, but look for the location where the IP address and remote host are written as shown below.
<tr><th>IP address</th><td style="font-size:18px;font-weight:bold;">82.223.99.999</td><td class="commentary">現在、接続されるIP address</td></tr>
<tr><th>Remote host</th><td>tornode3.*******.net</td><td class="commentary">Host name associated with an IP address</td></tr>
** IP address ** 82.223.99.999
** Remote host ** tornode3.*******.net
As you can see, not only the IP address but also the remote host is suitable.
Reboot
brew services restart tor
run test.py
python test.py
Check the result.
<tr><th>IP address</th><td style="font-size:18px;font-weight:bold;">109.70.999.99</td><td class="commentary">現在、接続されるIP address</td></tr>
<tr><th>Remote host</th><td>tor-exit-anonymizer.********.net</td><td class="commentary">Host name associated with an IP address</td></tr>
** IP address ** 109.70.999.99
** Remote host ** tor-exit-anonymizer.********.net
What do you think. As mentioned above, falsification of the IP address can be done easily. Then, it is not so if IP check is useless for DoS attacks. To change the IP address, you have to restart tor, which takes some time. Therefore, it is difficult to attack with different IP addresses hundreds of times per second. Therefore, a program that temporarily rejects a certain number of accesses from the same IP address is effective to some extent. ** However, it is not effective against DDos attacks **
Stop wasting access and mischief with scraping.
Recommended Posts