・ Because I am suffering from tinnitus, a bot that regularly tweets information about "tinnitus". ・ Mac ・ Python3 -Specifically, I created an application that executes the following two. [1] Scraping search results for "tinnitus" and "dizziness" from Yahoo News and tweeting regularly [2] Regularly retweet tweets such as "improvement of tinnitus" and "cause of tinnitus" (good)
Create a directory miminari on your desktop and scraping.py. Build and start a virtual environment as follows.
python3 -m venv .
source bin/activate
Install the required modules.
pip install requests
pip install beautifulsoup4
pip install lxml
Directory structure
miminari
├scraping.py
├date_list.txt
├source_list.txt
├text_list.txt
├title_list.txt
├url_list.txt
├twitter.py
├Procfile
├requirements.txt
└runtime.txt
Search for "tinnitus" and "dizziness" from Yahoo News and copy the url. The site shows 10 news items. Find a likely location for the title and URL. If you look at the "verification" of Google Chrome, you can see that it is in class = t of the h2 tag. Based on this, I will write the code.
.py:scraping.py
from bs4 import BeautifulSoup
import lxml
import requests
URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b=1"
res = requests.get(URL)
res.encoding = res.apparent_encoding
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")
list = soup.find_all("h2",class_="t")
print(list)
Then, you can get it in list format as follows.
[<h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200320-00000019-nkgendai-hlth">Misono is also fighting against Meniere's disease No radical cure has been found, but how do you deal with it?</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200316-00010000-flash-ent">Shoko Aida, the past suffering from sudden hearing loss and Meniere's disease<em>Tinnitus</em>But…"</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200315-00000004-nikkeisty-hlth">Gluten upset, treatment is dangerous without guidance</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200313-00000243-spnannex-ent">Shoko Aida confesses her illness to retire from the entertainment world for the first time. Thanks to the doctor's "mental care"</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200310-00000011-pseven-life">"Hearing loss" is a risk factor for dementia Depression risk 2.Data with 4 times</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200309-00010009-nishispo-spo">Olympic representative Ono's classmate overcomes illness and goes to the big stage 81 kg class indiscriminate challenge</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200226-00010011-newsweek-int">Iran's mysterious shock wave that hit the U.S. military takes several years to unravel</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200223-00986372-jspa-life">Chronic condition and fertility make me sick ... Tears at the words my husband gave to Alafor's wife</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/hl?a=20200215-00001569-fujinjp-life">Insufficient thermal energy? Blood circulation stagnation? Know the type of "cold" and become a body that can withstand the cold</a></h2>, <h2 class="t"><a href="https://headlines.yahoo.co.jp/article?a=20200214-00010009-jisin-soci">Recommended by a doctor! Insomnia, menstrual cramps, headaches ... "Normal heat 36."5 degrees" prevents upset</a></h2>]
-Encoding is the character encoding of the response returned from the server. The content is converted according to this character encoding. -Apparent_encoding means a process to enable the content to be acquired so that garbled characters do not occur. -Lxml is one of the HTML parsers that parses HTML words, determines tags, etc., and acquires them as a data structure. Normally I use html.parser. This time we will use lxml as a faster parser. lxml needs to be installed and imported separately. -From the verification of Google Chrome, it can be seen that the title and URL are in class = t of the h2 tag, so find_all ("h2", class_ = "t") was used.
.py:scraping.py
from bs4 import BeautifulSoup
import lxml
import requests
URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b=1"
res = requests.get(URL)
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")
#Get title and url------------------------------
_list = soup.find_all("h2",class_="t")
title_list = []
url_list = []
for i in _list:
a_tag = i.find_all('a')
for _tag in a_tag:
#Extract title, get_text()Extracts the string enclosed in tags
href_text = _tag.get_text()
#Create a list with extracted titles
title_list.append(href_text)
#get("href")Extracts urls enclosed in tags
url_text = _tag.get("href")
#Create a list with extracted titles
url_list.append(url_text)
#Save in text format
with open('title_data'+'.txt','a',encoding='utf-8') as f:
for i in title_list:
f.write(i + '\n')
with open('url_data'+'.txt','a',encoding='utf-8') as f:
for i in url_list:
f.write(i + '\n')
-Get.text () can extract the character string enclosed in tags. -Get ("href") can get the attribute value.
Scrap the news summary and date and time as well and save each one.
.py:scraping.py
from bs4 import BeautifulSoup
import lxml
import requests
URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b=1"
res = requests.get(URL)
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")
#Get title and url------------------------------
_list = soup.find_all("h2",class_="t")
title_list = []
url_list = []
for i in _list:
a_tag = i.find_all('a')
for _tag in a_tag:
#Extract title, get_text()Extracts the string enclosed in tags
href_text = _tag.get_text()
#Create a list with extracted titles
title_list.append(href_text)
#get("href")Extracts urls enclosed in tags
url_text = _tag.get("href")
#Create a list with extracted titles
url_list.append(url_text)
with open('title_data'+'.txt','a',encoding='utf-8') as f:
for i in title_list:
f.write(i + '\n')
with open('url_data'+'.txt','a',encoding='utf-8') as f:
for i in url_list:
f.write(i + '\n')
#Get text-----------------------------------------
_list2 = soup.find_all("p",class_="a")
text_list = []
for i in _list2:
text_text = i.get_text()
text_list.append(text_text)
with open('text_list'+'.txt','a',encoding='utf-8')as f:
for i in text_list:
f.write(i + '\n')
#Get date and time---------------------------------------------------------------
_list3 = soup.find_all("span",class_="d")
date_list = []
for i in _list3:
_date_text = i.get_text()
_date_text = _date_text.replace('\xa0','')
date_list.append(_date_text)
with open('date_list'+'.txt','a',encoding='utf-8') as f:
for i in date_list:
f.write(i + '\n')
#Get the source---------------------------------------------------------------
_list4 = soup.find_all("span",class_="ct1")
source_list = []
for i in _list4:
_source_text = i.get_text()
source_list.append(_source_text)
with open('source_list'+'.txt','a',encoding='utf-8') as f:
for i in source_list:
f.write(i + '\n')
-Although it is the date and time, if it is extracted as it is, the extra character "& nbsp" will also be extracted, so it is blanked using replace (when scraped, "& nbsp" is written as "\ xa0". Therefore, it was replaced ('\ xa0','').
At this rate, only 10 news items for one page will be scraped, so turn this for 4 pages (because turning 4 pages means that the news search results for "tinnitus" and "dizziness" were only 4 pages). .. Modify the code as follows.
.py:scraping.py
from bs4 import BeautifulSoup
import lxml
import requests
mm = 0
for i in range(4):
URL = "https://news.yahoo.co.jp/search/?p=%E8%80%B3%E9%B3%B4%E3%82%8A+%E3%82%81%E3%81%BE%E3%81%84&oq=&ei=UTF-8&b={}".format(mm*10 + 1)
res = requests.get(URL)
html_doc = res.text
soup = BeautifulSoup(html_doc,"lxml")
#Get title and url------------------------------
_list = soup.find_all("h2",class_="t")
title_list = []
url_list = []
for i in _list:
a_tag = i.find_all('a')
for _tag in a_tag:
#Extract title, get_text()Extracts the string enclosed in tags
href_text = _tag.get_text()
#Create a list with extracted titles
title_list.append(href_text)
#get("href")Extracts urls enclosed in tags
url_text = _tag.get("href")
#Create a list with extracted titles
url_list.append(url_text)
with open('title_list'+'.txt','a',encoding='utf-8') as f:
for i in title_list:
f.write(i + '\n')
with open('url_list'+'.txt','a',encoding='utf-8') as f:
for i in url_list:
f.write(i + '\n')
#Get text-----------------------------------------
_list2 = soup.find_all("p",class_="a")
text_list = []
for i in _list2:
text_text = i.get_text()
text_list.append(text_text)
with open('text_list'+'.txt','a',encoding='utf-8')as f:
for i in text_list:
f.write(i + '\n')
#Date and time,---------------------------------------------------------------
_list3 = soup.find_all("span",class_="d")
date_list = []
for i in _list3:
_date_text = i.get_text()
_date_text = _date_text.replace('\xa0','')
date_list.append(_date_text)
with open('date_list'+'.txt','a',encoding='utf-8') as f:
for i in date_list:
f.write(i + '\n')
#Source---------------------------------------------------------------
_list4 = soup.find_all("span",class_="ct1")
source_list = []
for i in _list4:
_source_text = i.get_text()
source_list.append(_source_text)
with open('source_list'+'.txt','a',encoding='utf-8') as f:
for i in source_list:
f.write(i + '\n')
#mm-------------------------------------------------------------------
mm += 1
The following parts have been added. Since the end of the URL is 1, 11, 21, 31 for each page, it was processed using the for statement and format.
mm = 0
for i in range(4): 〜〜〜〜
〜〜〜〜 q=&ei=UTF-8&b={}".format(mm*10 + 1)
〜〜〜〜
mm += 1
Up to this point, scraping has created the title (title_list), URL (url_list), summary (text_list), date and time (date_list), and source (source_list) of each news item. After that, I will proceed with posting to Twitter, but I used only the date and time (date_list), source (source_list), and URL (url_list).
This time, I will omit the detailed method of creating Twitter Bot. When creating a bot, I referred to the following for how to register on the Twitter API and tweet to Twitter. Summary of steps from Twitter API registration (account application method) to approval Post to Twitter on Tweepy Search Twitter on Tweepy, Like, Retweet
Create twitter.py in the directory miminari and install Tweepy.
pip install tweeps
Create twitter.py as follows
.py:twitter.py
import tweepy
from random import randint
import os
auth = tweepy.OAuthHandler(os.environ["CONSUMER_KEY"],os.environ["CONSUMER_SECRET"])
auth.set_access_token(os.environ["ACCESS_TOKEN"],os.environ["ACCESS_TOKEN_SECERET"])
api = tweepy.API(auth)
twitter_source =[]
twitter_url = []
twitter_date = []
with open('source_list.txt','r')as f:
for i in f:
twitter_source.append(i.rstrip('\n'))
with open('url_list.txt','r')as f:
for i in f:
twitter_url.append(i.rstrip('\n'))
with open('date_list.txt','r')as f:
for i in f:
twitter_date.append(i.rstrip('\n'))
#Randomly extract articles from the 0th to n-1st range of the list with the randint and len functions
i = randint(0,len(twitter_source)-1)
api.update_status("<News related to tinnitus>" + '\n' + twitter_date[i] + twitter_source[i] + twitter_url[i])
-Consumer_KEY etc. are set as environment variables based on the deployment to Heroku. ・ Articles are now tweeted randomly.
Add code to twitter.py
.py:twitter.py
import tweepy
from random import randint
import os
#auth = tweepy.OAuthHandler(config.CONSUMER_KEY,config.CONSUMER_SECRET)
#auth.set_access_token(config.ACCESS_TOKEN,config.ACCESS_TOKEN_SECERET)
auth = tweepy.OAuthHandler(os.environ["CONSUMER_KEY"],os.environ["CONSUMER_SECRET"])
auth.set_access_token(os.environ["ACCESS_TOKEN"],os.environ["ACCESS_TOKEN_SECERET"])
api = tweepy.API(auth)
#-Yahoo_news (tinnitus, dizziness) Tweet processing----------------------------------------------
twitter_source =[]
twitter_url = []
twitter_date = []
with open('source_list.txt','r')as f:
for i in f:
twitter_source.append(i.rstrip('\n'))
with open('url_list.txt','r')as f:
for i in f:
twitter_url.append(i.rstrip('\n'))
with open('date_list.txt','r')as f:
for i in f:
twitter_date.append(i.rstrip('\n'))
#Randomly extract articles from the 0th to n-1st range of the list with the randint and len functions
i = randint(0,len(twitter_source)-1)
api.update_status("<News related to tinnitus>" + '\n' + twitter_date[i] + twitter_source[i] + twitter_url[i])
#-(The following is added) Retweet processing----------------------------------------------------------------------
search_results_1 = api.search(q="Improvement of tinnitus", count=10)
search_results_2 = api.search(q="Tinnitus is terrible", count=10)
search_results_3 = api.search(q="Tinnitus", count=10)
search_results_4 = api.search(q="Tinnitus medicine", count=10)
search_results_5 = api.search(q="What is tinnitus?", count=10)
search_results_6 = api.search(q="Cause of tinnitus", count=10)
search_results_7 = api.search(q="Tinnitus Chinese medicine", count=10)
search_results_8 = api.search(q="Tinnitus acupoints", count=10)
search_results_9 = api.search(q="Tinnitus headache", count=10)
search_results_10 = api.search(q="#Tinnitus", count=10)
search_results_11 = api.search(q="Tinnitus", count=10)
the_list = [search_results_1,
search_results_2,
search_results_3,
search_results_4,
search_results_5,
search_results_6,
search_results_7,
search_results_8,
search_results_9,
search_results_10,
search_results_11
]
for i in range(10):
for result in the_list[i]:
tweet_id = result.id
#Handle exceptions. It seems that an error will occur if you process duplicates,
#If exception handling is used, the program will not stop in the middle.
try:
api.retweet(tweet_id)#Retweet processing
api.create_favorite(tweet_id) #Like processing
except Exception as e:
print(e)
Create the Procfile, runtime.txt, and requirements.txt required for deployment. Create runtime.txt after confirming your own python version.
.txt:runtime.txt
python-3.8.0
Procfile describes the following.
Prockfile
web: python twitter.py
Describe requirements.txt by entering the following in the terminal.
pip freeze > requirements.txt
Next, deploy as follows. Initialize git, associate Heroku with git, add, and commit with the name the-first. Finally push to Heroku.
git init
heroku git:remote -a testlinebot0319
git add .
git commit -m'the-first'
git push heroku master
If you enter the following in the terminal and post it on Twitter before regular execution, it will be successful for the time being.
Execute the following in the terminal to set the regular execution of Heroku directly on the browser.
heroku adding:add scheduler:standard
heroku adding:open scheduler
If you set as above, it is completed (the above is tweet setting every 10 minutes)
If you like, please follow me Twitter@MiminariBot
Recommended Posts