I want to scrape a web page (html after js expansion).
I was thinking of scraping with curl or php, I was in trouble to understand that curl did not pick up the source after js.
After investigating there, the following two are candidates.
phantomjs
There was a lot of information and I felt that it was effective as it was, but I found that development ended in June 2018 and support ended.
Selenium + WebDriver
When I looked it up, there was a lot of information and many new articles, so I decided to try it with Selenium for the time being.
python
pip
chromedriver
selenium
Since I am using a Mac and Python is included as standard, I will omit the installation of Python.
$ curl -kL https://bootstrap.pypa.io/get-pip.py | python
Execution result
$ curl -kL https://bootstrap.pypa.io/get-pip.py | python
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 1841k 100 1841k 0 0 4630k 0 --:--:-- --:--:-- --:--:-- 4649k
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Defaulting to user installation because normal site-packages is not writeable
Collecting pip
Downloading pip-20.2.3-py2.py3-none-any.whl (1.5 MB)
|████████████████████████████████| 1.5 MB 4.0 MB/s
Collecting wheel
Downloading wheel-0.35.1-py2.py3-none-any.whl (33 kB)
Installing collected packages: pip, wheel
WARNING: The scripts pip, pip2 and pip2.7 are installed in '/Users/xxx/Library/Python/2.7/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The script wheel is installed in '/Users/xxx/Library/Python/2.7/bin' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed pip-20.2.3 wheel-0.35.1
There is a message to pass the path, so pass the path
$ export PATH="$HOME/Library/Python/2.7/bin:$PATH"
$ echo 'export PATH="$HOME/Library/Python/2.7/bin:$PATH"' >> ~/.bash_profile
Check if the pass is passed
$ echo $PATH
/Users/xxx/Library/Python/2.7/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin
$ cat ~/.bash_profile
export PATH="$HOME/Library/Python/2.7/bin:$PATH"
Now that you can use the pip command, check
$ pip -V
pip 20.2.3 from /Users/xxx/Library/Python/2.7/lib/python/site-packages/pip (python 2.7)
First, check the version of Chrome you are currently using on your computer.
Version: 85.0.4181.101
So, use the following command
pip install chromedriver-binary==85.*
Execution result
$ pip install chromedriver-binary==85.*
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Defaulting to user installation because normal site-packages is not writeable
Collecting chromedriver-binary==85.*
Downloading chromedriver-binary-85.0.4183.87.0.tar.gz (3.6 kB)
Building wheels for collected packages: chromedriver-binary
Building wheel for chromedriver-binary (setup.py) ... done
Created wheel for chromedriver-binary: filename=chromedriver_binary-85.0.4183.87.0-py2-none-any.whl size=7722067 sha256=901454e21156aef8f8bf4b0e302098747ea378a435c801330ea46d03ed
Stored in directory: /Users/xxx/Library/Caches/pip/wheels/12/27/b7/69d38bfd65642b45a64e7e97e3160aba20f20be91cd5a
Successfully built chromedriver-binary
Installing collected packages: chromedriver-binary
Successfully installed chromedriver-binary-85.0.4183.87.0
$
Command used
pip install selenium
Execution result
$ pip install selenium
DEPRECATION: Python 2.7 reached the end of its life on January 1st, 2020. Please upgrade your Python as Python 2.7 is no longer maintained. pip 21.0 will drop support for Python 2.7 in January 2021. More details about Python 2 support in pip can be found at https://pip.pypa.io/en/latest/development/release-process/#python-2-support pip 21.0 will remove support for this functionality.
Defaulting to user installation because normal site-packages is not writeable
Collecting selenium
Downloading selenium-3.141.0-py2.py3-none-any.whl (904 kB)
|████████████████████████████████| 904 kB 5.2 MB/s
Collecting urllib3
Downloading urllib3-1.25.10-py2.py3-none-any.whl (127 kB)
|████████████████████████████████| 127 kB 10.7 MB/s
Installing collected packages: urllib3, selenium
Successfully installed selenium-3.141.0 urllib3-1.25.10
$
Now you are ready to go.
test.py
import chromedriver_binary
from selenium import webdriver
options = webdriver.ChromeOptions()
# options.add_argument('--incognito')
# options.add_argument('--headless')
print('connect...try...connect...try...')
driver = webdriver.Chrome(options=options)
driver.get('https://qiita.com')
print(driver.current_url)
# driver.quit()
Run
$ python test.py
This will bring up the Chrome browser. I'm happy.
To launch in the secret window, uncomment the following.
options.add_argument('--incognito')
If you use a headless browser, please uncomment the following.
options.add_argument('--headless')
After that, I think that anyone can scrape by checking Selenium and xpath.
The version of python that was included in the Mac this time was 2.7, so it is a little old and support will end in January 2020. I don't usually use Python, so I leave it as it is, but in the execution result of each command, a message (DEPRECATION) for 2.7 appears. Please forgive me m (_ _) m
pip installation https://qiita.com/suzuki_y/items/3261ffa9b67410803443 https://qiita.com/tom-u/items/134e2b8d4e11feea8e12
Selenium setup https://qiita.com/Chanmoro/items/9a3c86bb465c1cce738a
Summary of how to select elements in Selenium https://qiita.com/VA_nakatsu/items/0095755dc48ad7e86e2f
Scraping Xpath https://qiita.com/rllllho/items/cb1187cec0fb17fc650a