Introduction

Hello. This is @tfujitani, a nerd on a slope. This time, I created Python that can save the blog image of the specified member of Nogizaka46 fully automatically, so I will publish it. By the way, the reason why I decided to make this program “trigger” is because Sayuri Inoue, who was a good candidate, graduated. (It's a good song, isn't it?) The program created for that purpose will be released now.

What I did this time

What was used

・ Python ・ Beautiful Soup

Installation of scraping environment

pip install requests
pip install beautifulsoup4

Python code

I'm scraping using Beautiful Soup and Python3. This time, specify the blog URL of the member you want to specify. For Manatsu Akimoto (http://blog.nogizaka46.com/manatsu.akimoto/), "manatsu.akimoto" Riria Ito (http://blog.nogizaka46.com/riria.itou/) is like "riria.itou". You can also specify the start and end points of the period you want to save.

`nogiblog.py`


# coding:utf-8
from time import sleep
import time
from bs4 import BeautifulSoup
import sys
import requests, urllib.request, os
from selenium.common.exceptions import TimeoutException

domain="http://blog.nogizaka46.com/"
member="manatsu.akimoto" #Member designation
url=domain+member+"/"

def getImages(soup,cnt,mouthtrue):
    member_path="./"+member
    #Function to save image
    for entry in soup.find_all("div", class_="entrybody"):#Get all entry bodies
        for img in entry.find_all("img"):#Get all img
            cnt +=1
            imgurl=img.attrs["src"]
            imgurlnon=imgurl.replace('https','http')
            if mouthtrue:
                try:
                    urllib.request.urlretrieve(imgurlnon, member_path+ str(year)+'0'+str(mouth) + "-" + str(cnt) + ".jpeg ")
                except:
                    print("error",imgurlnon)
            else:
                try:
                    urllib.request.urlretrieve(imgurlnon, member_path + str(year)+str(mouth) + "-" + str(cnt) + ".jpeg ")
                except:
                    print("error",imgurlnon)


if(__name__ == "__main__"):
    #The beginning of the blog to save
    year=2012
    mouth=12
    #End of blog to save
    endyear=2020
    endmouth=6

    while(True):
        mouthtrue=True
        if mouth<10:
            BlogPageURL=url+"?d="+str(year)+"0"+str(mouth)
        else:
            BlogPageURL=url+"?d="+str(year)+str(mouth)
            mouthtrue=False
        headers = {"User-Agent": "Mozilla/5.0"}
        soup = BeautifulSoup(requests.get(BlogPageURL, headers=headers).content, 'html.parser')#Get html
        print(year,mouth)
        sleep(3)
        cnt = 0
        ht=soup.find_all("div", class_="paginate")
        print("ht",ht)
        getImages(soup,cnt,mouthtrue)#Calling the image storage function
        if len(ht)>0:#If there are multiple pages in the same month, save only that page
            ht_url=ht[0]
            print(ht_url)
            url_all=ht_url.find_all("a")
            for i,hturl in enumerate(url_all):
                if (i+1)==len(url_all):
                    break
                link = hturl.get("href")
                print("url",url+link)
                soup = BeautifulSoup(requests.get(url+link, headers=headers).content, 'html.parser')
                sleep(3)
                getImages(soup,cnt,mouthtrue)#Calling the image storage function
        if year==endyear and mouth==endmouth:
            print("Finish")
            sys.exit()#The end of the program
        if mouth==12:
            mouth=1
            year=year+1
            print("update",year,mouth)
        else:
            mouth=mouth+1
            print("update",year,mouth)

By the way, "#If there are multiple pages in the same month, save only those pages" is an image like this. スクリーンショット 2020-06-26 15.21.20.png In the example of this image, it is Manatsu Akimoto's blog in January 2013, but after saving the image on the first page, get 2, 3 and 4 links and display the image on each page. The content is to save.

Execution result

When I tried it on Manatsu Akimoto's blog, I was able to confirm that the image was saved in the following form.

The image of January 2013 is missing because the original image posted on the blog is missing.

By the way, I thought that ht was difficult to understand in the previous program, so I will display the execution result of that part. It's a little confusing, but like this, each monthly page is scraped.

ht 
[<div class="paginate"> 1  | <a href="?p=2&amp;d=201301"> 2 </a> | <a href="?p=3&amp;d=201301"> 3 </a> | <a href="?p=4&amp;d=201301"> 4 </a> | <a href="?p=2&amp;d=201301">＞</a></div>, <div class="paginate"> 1  | <a href="?p=2&amp;d=201301"> 2 </a> | <a href="?p=3&amp;d=201301"> 3 </a> | <a href="?p=4&amp;d=201301"> 4 </a> | <a href="?p=2&amp;d=201301">＞</a></div>]

After that, you can see that after scraping the first page as shown below, scraping the fourth page.

url http://blog.nogizaka46.com/manatsu.akimoto/?p=2&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=3&d=201301
url http://blog.nogizaka46.com/manatsu.akimoto/?p=4&d=201301

in conclusion

As an aside, when I saved Sayuri Inoue's blog, I had a margin of over several thousand (2385 by erasing the unnecessary ones). スクリーンショット 2020-06-26 15.40.16.png You can see the part of Sayu's hard worker.

References

The article at https://qiita.com/xxPowderxx/items/e9726b8b8a114655d796 was insanely helpful.

Nogizaka46 A program that automatically saves blog images