Hello. This is @tfujitani, a nerd on a slope. This time, I created Python that can save the blog image of the specified member of Nogizaka46 fully automatically, so I will publish it. By the way, the reason why I decided to make this program “trigger” is because Sayuri Inoue, who was a good candidate, graduated. (It's a good song, isn't it?) The program created for that purpose will be released now.
・ Python ・ Beautiful Soup
pip install requests
pip install beautifulsoup4
I'm scraping using Beautiful Soup and Python3. This time, specify the blog URL of the member you want to specify. For Manatsu Akimoto (http://blog.nogizaka46.com/manatsu.akimoto/), "manatsu.akimoto" Riria Ito (http://blog.nogizaka46.com/riria.itou/) is like "riria.itou". You can also specify the start and end points of the period you want to save.
nogiblog.py
# coding:utf-8
from time import sleep
import time
from bs4 import BeautifulSoup
import sys
import requests, urllib.request, os
from selenium.common.exceptions import TimeoutException
domain="http://blog.nogizaka46.com/"
member="manatsu.akimoto" #Member designation
url=domain+member+"/"
def getImages(soup,cnt,mouthtrue):
member_path="./"+member
#Function to save image
for entry in soup.find_all("div", class_="entrybody"):#Get all entry bodies
for img in entry.find_all("img"):#Get all img
cnt +=1
imgurl=img.attrs["src"]
imgurlnon=imgurl.replace('https','http')
if mouthtrue:
try:
urllib.request.urlretrieve(imgurlnon, member_path+ str(year)+'0'+str(mouth) + "-" + str(cnt) + ".jpeg ")
except:
print("error",imgurlnon)
else:
try:
urllib.request.urlretrieve(imgurlnon, member_path + str(year)+str(mouth) + "-" + str(cnt) + ".jpeg ")
except:
print("error",imgurlnon)
if(__name__ == "__main__"):
#The beginning of the blog to save
year=2012
mouth=12
#End of blog to save
endyear=2020
endmouth=6
while(True):
mouthtrue=True
if mouth<10:
BlogPageURL=url+"?d="+str(year)+"0"+str(mouth)
else:
BlogPageURL=url+"?d="+str(year)+str(mouth)
mouthtrue=False
headers = {"User-Agent": "Mozilla/5.0"}
soup = BeautifulSoup(requests.get(BlogPageURL, headers=headers).content, 'html.parser')#Get html
print(year,mouth)
sleep(3)
cnt = 0
ht=soup.find_all("div", class_="paginate")
print("ht",ht)
getImages(soup,cnt,mouthtrue)#Calling the image storage function
if len(ht)>0:#If there are multiple pages in the same month, save only that page
ht_url=ht[0]
print(ht_url)
url_all=ht_url.find_all("a")
for i,hturl in enumerate(url_all):
if (i+1)==len(url_all):
break
link = hturl.get("href")
print("url",url+link)
soup = BeautifulSoup(requests.get(url+link, headers=headers).content, 'html.parser')
sleep(3)
getImages(soup,cnt,mouthtrue)#Calling the image storage function
if year==endyear and mouth==endmouth:
print("Finish")
sys.exit()#The end of the program
if mouth==12:
mouth=1
year=year+1
print("update",year,mouth)
else:
mouth=mouth+1
print("update",year,mouth)
By the way, "#If there are multiple pages in the same month, save only those pages" is an image like this. In the example of this image, it is Manatsu Akimoto's blog in January 2013, but after saving the image on the first page, get 2, 3 and 4 links and display the image on each page. The content is to save.
When I tried it on Manatsu Akimoto's blog, I was able to confirm that the image was saved in the following form.
By the way, I thought that ht was difficult to understand in the previous program, so I will display the execution result of that part. It's a little confusing, but like this, each monthly page is scraped.
ht
[<div class="paginate"> 1 | <a href="?p=2&d=201301"> 2 </a> | <a href="?p=3&d=201301"> 3 </a> | <a href="?p=4&d=201301"> 4 </a> | <a href="?p=2&d=201301">></a></div>, <div class="paginate"> 1 | <a href="?p=2&d=201301"> 2 </a> | <a href="?p=3&d=201301"> 3 </a> | <a href="?p=4&d=201301"> 4 </a> | <a href="?p=2&d=201301">></a></div>]
After that, you can see that after scraping the first page as shown below, scraping the fourth page.
url http://blog.nogizaka46.com/manatsu.akimoto/?p=2&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=3&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=4&d=201301
As an aside, when I saved Sayuri Inoue's blog, I had a margin of over several thousand (2385 by erasing the unnecessary ones). You can see the part of Sayu's hard worker.
The article at https://qiita.com/xxPowderxx/items/e9726b8b8a114655d796 was insanely helpful.
Recommended Posts