There was data on the WEB whose values changed in real time. I decided to make a program to check the value regularly, but it was troublesome to write scraping code because I had to log in to the page. As a countermeasure, I decided to use selenium to operate a web browser and scrape. I will summarize the process as a memorandum.
In addition, it was okay to use the PC at hand and automatically execute the Web browser by batch processing, but it was an obstacle to launch the Web browser on my own PC that I usually use. Let's run it on the rental server (Ubuntu16.04) without permission.
More specifically, the image is as follows. (1) Launch a web browser via python → Explained in Part1 (2) Operate the web browser with selenium and process the web data → Explained in Part2 (3) Store the processed data in mongoDB → Part3 (this post) (4) Automatically execute the py program that executes (1) to (3) with cron → Part3 (this post) (5) If there is a certain fluctuation in the value, notify by e-mail → Bonus
The program that automatically acquires specific data on the Web in Part1 and Part2 Now that it's done, set the program to run automatically in CRON.
OS: Ubuntu16.04 (Sakura VPS) python : version 3.5 mongoDB : version 2.6.10 PhantomJS : version 2.1.1
#Checking the operation of cron
sudo service cron status
#Edit cron config file
crontab -e
Described in crontab as follows
*/5 * * * * <path_to_python>/python3 /<path_to_file/test.py >> /<path_to_log>/test.log 2>>&1
Specify the python program created in Part1, Part2 as the job as above. If you do, the browser will start up every 5 minutes and fetch specific data from a specific site.
It's a bonus from here. I made this fixed point observation program on Ubuntu, which is the default setting, so let's make a note of the DB storage as well.
If the output to a txt file is fine, it is not necessary to write it,
File output
f = open( "test.txt", "a+" )
f.write( data )
f.close
Will be fine.
I'm actually storing it in MongoDB.
Follow the steps below to install.
1)Public key setting
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 7F0CEB10
2) mongodb.Create list
echo 'deb http://downloads-distro.mongodb.org/repo/ubuntu-upstart dist 10gen' | sudo tee /etc/apt/sources.list.d/mongodb.list
3)Actually install
sudo apt-get update
#sudo apt-get install mongodb-10gen
4) mongod.Create service
sudo vim /lib/systemd/system/mongod.service
▼mongod.Contents of service
[Unit]
Description=MongoDB Database Service
Wants=network.target
After=network.target
[Service]
ExecStart=/usr/bin/mongod --config /etc/mongod.conf
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
User=mongodb
Group=mongodb
StandardOutput=syslog
StandardError=syslog
[Install]
WantedBy=multi-user.target
(Reference URL) http://qiita.com/pelican/items/bb9b5290bb73acedc282
Install the pymongo package that operates MongoDB from python
pip3 install pymongo
sudo systemctl start mongod
A simple fixed-point observation program that combines Part1 and Part2 and this post. I will write.
As with Part2, the data I'm actually observing cannot be published, so this time I will automatically get the posts in the top feed of Qiita. Let's write a program.
Here is a summary of the contents of the program. (1) Launch the browser PhantomJS (2) Automatically log in to Qiita and automatically get the 20 post names from the top of the feed. (3) Store the post name obtained in (2) in the list and output it to MongoDB. The actual program code is as follows.
import time
import datetime
from selenium import webdriver
from bs4 import BeautifulSoup
URL = "https://qiita.com/"
USERID = "<YOUR_USER_ID>"
PASS = "<YOUR_PASSWORD>"
#Automatic startup of PhantomJS and access to Qiita
browser = webdriver.PhantomJS(executable_path='<path/to/phantomjs>')
browser.get(URL)
time.sleep(3)
#Login page
browser.find_element_by_id("identity").send_keys(USERID)
browser.find_element_by_id("password").send_keys(PASS)
browser.find_element_by_xpath('//input[@name="commit"]').click()
time.sleep(5)
#Get a list of posts on the home screen
html = browser.page_source.encode('utf-8')
soup = BeautifulSoup(html, "lxml")
posts_source = soup.select(".item-box-title > h1 > a")
#Post name data organization
posts = []
for i in (0,len(posts_source)):
posts[i] = post.text.strip()
#Get the time of fixed point observation
output = {}
output["date"] = str(datetime.date.today())
output["datetime"] = str(datetime.datetime.today().strftime("%H:%M:%S"))
output["content"] = posts
#Store in MongoDB
mongo = MongoClient('localhost:27017')
db = mongo_client["qiita"]
new_posts = db_connect["new_posts"]
new_posts.insert(output)
#Close browser
browser.close()
It is like this. By regularly executing this program with cron, 20 latest post names will be recorded from the feed after logging in to Qiita. (Because it is a test program for posting, I think it is practical ^^;)
By applying this program, you will be able to perform fixed-point observation of data on various web pages with or without GET / POST.
Recommended Posts