Web scraping with BeautifulSoup4 (layered page)

Web scraping with Beutiful Soup 4

Following WEB scraping with BeautifulSoup4 (serial number page), I wrote the code for the layered page, so make a note.

point

If you create a list in the order of category → page → desired file and process it, it is easy to restart even if it is interrupted in the middle

code

scraper.py


# -*- coding: utf-8 -*-

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

try:
    # Python 3
    from urllib import request
except ImportError:
    # Python 2
    import urllib2 as request

from bs4 import BeautifulSoup
import time, os, codecs, string, json

domain = 'http://hoge.com'
wait_sec = 3
headers = { 'User-Agent' : 'Mozilla/5.0)' }
cwd = os.getcwd()
result_file = cwd + '/result_url.txt'
category_file = cwd + '/category.txt'
page_file = cwd + '/page.txt'

def fetchSoup(url):
    time.sleep(wait_sec)

    req = request.Request(url, headers = headers)
    try:
        print('open {url}'.format(url = url))
        response = request.urlopen(req)
        print('ok')
        body = response.read()
        return BeautifulSoup(body, 'lxml')
    except URLError, e:
        print('error: {reason}'.format(reason = e.reason))
        return None

def getUrl(src):
    return '{domain}{src}'.format(domain = domain, src = src)

def extractUrlFromTags(tags):
    result = []
    for tag in tags:
        if tag.name == 'a':
            result.append(getUrl(tag['href']))
        elif tag.name == 'img':
            result.append(getUrl(tag['src']))
    return result

def saveUrl(file_name, url_list):
    with codecs.open(file_name, 'a', 'utf-8') as f:
        f.write('{list}\n'.format(list = '\n'.join(url_list)))

def deleteFirstLine(file_name):
    with codecs.open(file_name, 'r', 'utf-8') as f:
        content = f.read()
        content = content[content.find('\n') + 1:]
    with codecs.open(file_name, 'w', 'utf-8') as f:
        f.write(content)

def fetchAllCategories():
    page = 1
    while True:
        url = '{domain}/category_{page}/'.format(domain = domain, page = page)
        soup = fetchSoup(url)
        categories = soup.find('div', id = 'list').find_all('a')
        url_list = extractUrlFromTags(categories)
        if len(url_list):
            saveUrl(category_file, url_list)
        page_list_last = soup.find('div', class = 'pagenation').find_all('a')[-1].string
        if page_list_last not in ['>', '>>']:
            break
        page += 1

def fetchCategory():
    if not os.path.exists(category_file):
        fetchAllCategories()
    with codecs.open(category_file, 'r', 'utf-8') as f:
        result = f.readline().rstrip('\n')
    return result

def fetchAllPages():
    category = fetchCategory()
    while category != '':
        soup = fetchSoup(category)
        pages = soup.find_all('a', class = 'page')
        url_list = extractUrlFromTag(pages)
        if len(url_list):
            saveUrl(page_file)
        deleteFirstLine(page_file)
        small_category = fetchCategory()

def fetchPage():
    if os.path.exists(page_file) or fetchCategory() != '':
        fetchAllPages()
    with codecs.open(page_file, 'r', 'utf-8'):
        result = f.readline().rstrip('\n')
    return result

def fetchTargets():
    page = fetchPage()
    while page != '':
        soup = fetchSoup(page)
        targets = soup.find_all('img', class = 'target')
        url_list = extractUrlFromTags(targets)
        if len(url_list):
            saveUrl(result_file, url_list)
        deleteFirstLine(page_file)
        page = fetchPage()

fetchTargets()

Convenient technique

When names etc. are categorized in the alphabet

alphabet_l = list(string.ascii_lowercase)
alphabet_u = list(string.ascii_uppercase)

To process variables etc. extracted from the script tag

data = json.loads(json_string)

In order to process in the background when executing with VPS etc. and continue even if you log off

$ nohup python scraper.py < /dev/null &

To check the process that is continuing

$ ps x

Recommended Posts

Web scraping with BeautifulSoup4 (layered page)
Web scraping with BeautifulSoup4 (serial number page)
[Personal note] Web page scraping with python3
Web scraping with python + JupyterLab
Save images with web scraping
Easy web scraping with Scrapy
Web scraping beginner with python
I-town page scraping with selenium
Web scraping with Python ① (Scraping prior knowledge)
Scraping Alexa's web rank with pyQuery
Web scraping with Python First step
I tried web scraping with python.
web scraping
WEB scraping with Python (for personal notes)
Getting Started with Python Web Scraping Practice
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
Monitor web page updates with LINE BOT
Getting Started with Python Web Scraping Practice
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
[For beginners] Try web scraping with Python
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
web scraping (prototype)
Scraping with Selenium
AWS-Perform web scraping regularly with Lambda + Python + Cron
[python] Quickly fetch web page metadata with lassie
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Extract data from a web page with Python
Flask-Python realization web page
Flask article summary page
Python web programming article summary
Web scraping with BeautifulSoup4 (layered page)
[Personal note] Web page scraping with python3
Monitor web page updates with LINE BOT
Web scraping with BeautifulSoup4 (serial number page)
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Data analysis for improving POG 1 ~ Web scraping with Python ~
Scraping with Python + PhantomJS
Introduction to Web Scraping
Quick web scraping with Python (while supporting JavaScript loading)
Scraping with scrapy shell
Python beginners get stuck with their first web scraping
Flask-Python realization web page
[For beginners] Web scraping with Python "Access the URL in the page to get the contents"
Scraping with Selenium [Python]
Python web scraping selenium
Scraping with Python + PyQuery
Scraping with Beautiful Soup
Scraping RSS with Python
[Part.2] Crawling with Python! Click the web page to move!
Web crawling, web scraping, character acquisition and image saving with python
Display a web page with FastAPI + uvicorn + Nginx (SSL / HTTPS)
I tried scraping with Python
Automatically download images with scraping
Scraping with selenium in Python
Scraping with Selenium + Python Part 1
Web scraping notes in python3
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Web application development with Flask