I read WEB manga once in a while, but there are so many that I don't know which one to read. I thought about whether comments could be used as an index for choosing which manga to read. Popular manga have many comments, and even if there are few comments, there are many interesting manga. So, I'm thinking of analyzing the comments in various ways, but as the first step, when I made the comments into a word cloud, I could visually grasp the comments and intuitively understand whether it was an intriguing manga. I was able to see it.
By selecting manga from a new perspective based on the word cloud, I hope to become the gateway to the manga that the writer wrote hard and contribute to the revitalization of the manga world. Is it a little exaggerated?
python 3.7.6 selenium 3.141.0 ChromeDriver 80.0.3987.16 wordcloud 1.6.0 BeautifulSoup 4.8.2 mecab-python-windows 0.996.3
We have created a site below where you can see the results. Click on the word cloud to move to that manga.
The following is the output result. Are you wondering what kind of manga it is? With comments such as "beautiful woman" and "like", I think I want to read a little.
We will scrape, so check the terms.
Excerpt from niconico Terms of Service
** 5 Prohibitions **
The following acts are prohibited regarding the use of "niconico" by users.
--Acts listed in paragraphs 3 and 4 of the Nico Nico Activity Guidelines or acts equivalent to these acts (including acts performed through means other than writing comments and posting videos, etc.) --Acts that violate the provisions of these Terms of Use --Acts that violate the Public Offices Election Act -** Acts that put an excessive burden on the "niconico" server ** --Acts that interfere with the operation of "niconico" --Links to child prostitution / pornography, uncensored video video download sites, etc. --Selling, auctioning, monetary payments and other similar acts without the permission of the operating company ――Advertising products without the permission of the operating company, publishing profile contents for the purpose of promotion, and other acts for the purpose of soliciting spam mails, chain mails, etc. ――The act of a minor over 13 years old using "niconico" without the consent of a legal representative (parental guardian) --Acts that the operating company considers inappropriate --Other acts similar to the above
Therefore, be careful not to put an excessive load on it. It is executed while not running continuously, sandwiching sleep, etc.
Execute the process according to the following flow.
You need to log in to see Nico Nico Seiga. Here, we use selenium to log in to NicoNico in the background.
It is assumed that selenium and Chrome Driver are installed. ChromeDriver
Import the required libraries below.
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
import urllib.parse
Set the options and build the driver.
The --headless
option is specified to run in the background.
Also, set_page_load_timeout
sets the timeout to 30 seconds.
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1024,768')
driver = webdriver.Chrome(options=options)
driver.set_page_load_timeout(30)
First, access https://account.nicovideo.jp/login?site=seiga&next_url=%2F
.
Next, get the items of e-mail address and password by ID and set each.
Finally, click the login button.
Please change [email address]
and [password]
to your own.
driver.get('https://account.nicovideo.jp/login?site=seiga&next_url=%2F')
e = driver.find_element(By.ID, "input__mailtel")
e.send_keys('[mail address]')
e = driver.find_element(By.ID, "input__password")
e.send_keys('[password]')
e = driver.find_element(By.ID, 'login__submit')
e.click()
You can also log in using the requests post, but in that case you need to get ʻauth_id` from the login screen and post it as well. Processing around that is unnecessary with selenium. Also, even if the screen is updated with JavaScript etc. after the screen is displayed, it takes a lot of trouble with requests, but with selenium it is convenient to be able to process without worrying about that.
We will get a list of still images in the following states. Get the manga URL of the manga list on each page in a list while changing pages. Considering the load, here we will get 1 to 3 pages.
url_root = 'https://seiga.nicovideo.jp'
desc_urls = []
for n in range(1, 4):
target_url = urllib.parse.urljoin(url_root, 'manga/list?page=%d&sort=manga_updated' % n)
try:
driver.get(target_url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
# change to loop
for desc in soup.select('.mg_description'):
title = desc.select('.title')
desc_urls.append(urllib.parse.urljoin(url_root, title[0].find('a').get('href')))
except Exception as e:
print(e)
continue
Save the URL to the manga in the desc_urls
list.
Set the URL to each page in target_url
. Since the page is controlled by setting the number in page =
of QueryString, set the number of the page you want to get there.
Get the page with driver.get
. Once you get it, get the HTML inside with driver.page_source.encode ('utf-8')
and set it to BeautifulSoup
for ease of use.
You can handle it without setting it to BeautifulSoup
, but I'm used to it, so I decided to use it. WebDriver can also use XPath, so I think it's okay as it is.
Since select
of BeautifulSoup
is a CSS selector, we get the .mg_description
and get the .title
in it and the href
of the ʻa` tag set there.
You now have a list of manga titles and URLs on the page.
Get the page with the URL stored in desc_urls
. The acquisition is done with driver.get (desc_url)
.
Once you get it, get the HTML as well and set it to BeautifulSoup
.
for desc_url in desc_urls:
try:
driver.get(desc_url)
html = driver.page_source.encode('utf-8')
soupdesc = BeautifulSoup(html, 'html.parser')
Get the element with id ng_main_column
in the div tag.
Get the elements of the `.main_title'class in it, and get the title and author.
Try printing and see if you can get it properly.
maindesc = soupdesc.find('div', id = 'mg_main_column')
titlediv = maindesc.select('.main_title')[0]
title = titlediv.find('h1').text.strip()
author = titlediv.find('span').text.strip()
print(title)
print(author)
The structure of HTML is as follows.
Since each episode is in the element whose class is .episode_item
, get the list with the CSS selector select
.
You will get multiple elements, from each element you will get the URL to the subtitle and details.
for eps in soupdesc.select('.episode_item'):
eps_ttl_div = eps.select('.title')
eps_title = eps_ttl_div[0].find('a')
eps_url = urllib.parse.urljoin(url_root, eps_title.get('href'))
eps_t = eps_title.text
print(eps_t)
try:
driver.get(eps_url)
html = driver.page_source.encode('utf-8')
soupeps = BeautifulSoup(html, 'html.parser')
The title is taken from the .title
class and the URL is taken from the a tag href
.
I am getting the details screen with driver.get (eps_url)
.
Once obtained, set it to Beautiful Soup
.
The class is getting the elements for .comment_list
and all the .comment
s in it.
I get the string in it with c.text
and set it in the array comments_text
.
The settings for the array use list comprehension notation. The inclusion notation of python seems to be Turing complete.
crlist = soupeps.select('.comment_list')
comments = crlist[0].select('.comment')
comments_text = [c.text for c in comments]
The HTML structure of the comment part is as follows. It seems that you can also find with comment_viewer
. Let's specify this area in a nice way.
The acquired comment character string is morphologically analyzed by MeCab. Let's add the import amount.
import MeCab
Perform morphological analysis with parse
in MeCab.
m = MeCab.Tagger('')
parsed = m.parse('。'.join(comments_text))
The result of morphological analysis is as follows.
'As expected\t noun,Adjectival noun stem,*,*,*,*,As expected,Sasuga,Sasuga\to n\t particle,Adverbization,*,*,*,*,To,D,D\n not\t adjective,Independence,*,*,Adjective, Auoudan,Continuous connection,Absent,Naka,Naka\n\t auxiliary verb,*,*,*,Special,Uninflected word,Ta,Ta,Ta\nwww\t noun,General,*,*,*,*,*\n。\t sign,Kuten,*,*,*,*,。,。,。\n it\t noun,Pronoun,General,*,*,*,It,Sole,Sole\n is\t particle,Particle,*,*,*,*,Is,C,Wow\n Writing implements\t noun,General,*,*,*,*,Writing implements,Hicking,Hicking\at n\t particle,Case particles,General,*,*,*,so,De,De\n is\t particle,Particle,*,*,*,*,Is,C,Wow\n Yes\t verb,Independence,*,*,Five steps, La line,Continuous form,is there,Ants,Ants\n\t auxiliary verb,*,*,*,Special / mass,Imperfective form,Masu,Mase,Mase\n\t auxiliary verb,*,*,*,Immutable type,Uninflected word,Hmm,Down,Down\n…\t sign,General,*,*,*,*,…,…,…\n。\t sign,Kuten,*,*,*,*,。,。,。\n Kisigai\t noun,General,*,*,*,*,*\n。\t sign,...
Since \ n
is line by line, take out line by line with splitlines
and get the basic form of the morpheme from the 7th on the right side separated by \ t
.
In doing so, we exclude particles and auxiliary verbs, pronouns, and some strings such as "suru" and "teru".
If you do not exclude it, when you create a word cloud, it will be displayed in large letters.
words = ' '.join([x.split('\t')[1].split(',')[6] for x in parsed.splitlines()[:-1] if x.split('\t')[1].split(',')[0] not in ['Particle', 'Auxiliary verb'] and x.split('\t')[1].split(',')[1] not in ['Pronoun'] and x.split('\t')[1].split(',')[6] not in ['To do', 'Teru', 'Become', 'Mr.', 'so', 'this', 'is there']])
Create a word cloud with to_file
in WordCloud.
comic_titles``
comic_subtitles``` comic_images
comic_urls` is a variable declared in an array and will be used later when creating HTML. Each holds a title, subtitle, image name, and URL.
When building WordCloud, the font, background color, and size are specified. The font used is "Ranobe POP", which seems to be often used on YouTube. Please specify this area as you like.
I am outputting to a file with wordcloud.to_file
.
if len(words) > 0:
try:
comic_titles.append(title)
comic_subtitles.append(eps_t)
comic_images.append('%d.png' % (comic_index))
comic_urls.append(eps_url)
wordcloud = WordCloud(font_path=r"C:\\WINDOWS\\Fonts\\Ranobe POP.otf", background_color="white", width=800,height=800).generate(words)
wordcloud.to_file("[Path you want to save]/wordcloud/%d.png " % (comic_index))
comic_index += 1
except Exception as e:
print(e)
The output result is the one shown first. Create HTML with these and publish it on the site.
https://comic.g-at.net/
When you access the above URL, the following list of word clouds will be displayed. Click on the word cloud to open the manga.
There are many comments such as Jump + and Manga One that are quite harsh, but Nico Nico has many gentle comments. After all, you're used to commenting.
It would be great if we could not only create a word cloud, but also analyze various things and open the door to masterpieces that were difficult to meet until now.
Recommended Posts