I tried scraping the advertisement of the pirated cartoon site

Background

A friend reading manga on a pirated manga site called manga1001.com "There are a lot of radical ads that I can't see outside, and I get a warning when I use Adblock." I said, so let's erase it! I thought.

Caution

This article is not intended to encourage the use of pirated manga sites.
I believe that people with high internet literacy who use Qiita do not use pirated manga sites.
I myself did it with technical interests, and I have no intention of using pirated manga sites in the future.
Under the current law at the time of writing the article (2020/01), browsing pirated manga sites is not illegal in itself.

Also, if you do something similar to this article,

Do not overload the server by scraping multiple pages
If you download and save the image, do not give it to a third party (the image is not downloaded in the scraping of this article)

Please be careful. You may be guilty.

Method

Enter any URL in manga1001.com
Open Chrome
Get the src of ʻimg`
Create HTML file
Write the obtained src as ʻimg`
Open the HTML file

Source code

I'm using Chrome Canary so that it's okay if it breaks.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import chromedriver_binary
from time import sleep

#Path to output the generated HTML file
output_path = '/Users/hoge/fuga/'

#Webdriver options
options = Options()
#Specify the path of Google Chrome Canary
options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'
#Specify the size of the window
options.add_argument('window-size=1600,900')

#Ask for the URL of the page you want to remove ads from
url = input('enter url: ')

#Launch Chrome
driver = webdriver.Chrome(options=options)
driver.get(url)

#Wait a moment for the page to execute JavaScript
sleep(3)

#Get title
title = driver.find_elements_by_class_name('entry-title')[0].text

#Get WebElement of img element as an array
contents = driver.find_elements_by_css_selector('.entry-content figure img')

#Assign a character string to be displayed as HTML to the output variable output
output = '''
<!DOCTYPE html>
<html>
<head>
<style>
body{
  background-color:#333;
}
img{
  display: block;
  margin: 10px auto;
  width: 100%;
  max-width: 600px;
  box-shadow: 0 0 10px black;
}
</style>
</head>
<body>
'''

#Add the src attribute of the acquired img element to output as an image
for content in contents:
  output += '<img src="{}"/>'.format(content.get_attribute('src'))

#Add closing tag to output
output += '</body></html>'

#Create an HTML file with the title name and write the output
with open('{0}{1}.html'.format(output_path, title), 'w', encoding='utf-8') as f:
  f.write(output)

#Open the created HTML file
driver.get('file://{0}{1}.html'.format(output_path, title))

What was made

manga1001.com_scraping.gif

Impressions

I was able to scrape the contents of the cluttered site neatly. Again, I'm not going to use it myself, and I didn't give this program to a friend. I just wanted to scrape! Lol

Recommended Posts

I tried scraping the advertisement of the pirated cartoon site

I tried scraping the ranking of Qiita Advent Calendar with Python

I tried the asynchronous server of Django 3.0

I tried scraping

I tried the pivot table function of pandas

I tried to touch the API of ebay

I tried to correct the keystone of the image

I tried using the image filter of OpenCV

I tried to predict the price of ETF

I tried to vectorize the lyrics of Hinatazaka46!

I tried to summarize the basic form of GPLVM