Content of this article

WEB scraping and extracting reviews and reviews from imports of MeCab and Beautiful Soup. Let's make it a wordcloud and visualize what is written! It is the content.

To be able to

For example, TripAdvisor's "word-of-mouth" can be used as a word cloud (visualization of frequently-used words ≒ topics that many people bother to mention in word-of-mouth). It may be interesting to see the difference when comparing the visualization of Tokyo Tower and Sky Tree, and Tokyo Tower and Tsutenkaku. .. .. Is the idea.

Make something like this from the "word of mouth" in the red frame.

Sites and articles that I referred to

I tried web scraping for the first time and even made wordcloud, but I will introduce the site that I referred to at that time first. Scraping review sites to find out the number of words [For beginners] Try web scraping with Python Active engineers explain how to use MeCab in Python [for beginners] Use wordcloud on Windows with Anaconda / Jupyter (Tips)

Library installation

First, install the required libraries. (If you already have it, please skip it.) By the way, my environment is Windows 10, Anaconda (jupyter). Beautiful Soup、request、wordcloud Launch Ancaonda Prompt and install Beautiful Soup and request.

conda install beautifulsoup4
conda install request
conda install -c conda-forge wordcloud

MeCab Download and install "Binary package for MS-Windows" from the Official Site. In this case, the dictionary is included from the beginning. If you get used to it, you can change it to another dictionary. You will be asked for the character code during installation, but to "UTF-8"! Others can be left as they are. After the installation is complete, set the environment variables. · Search for "system details" (probably in the search window at the bottom left of the taskbar) -Select "Environment Variables" -Select the system environment variable "Path" -Click Edit and select New -Enter "C: \ Program Files (x86) \ MeCab \ bin" ・ Select OK and close the screen Active engineers explain how to use MeCab in Python [for beginners] has a procedure, so please take a look there as well.

Production from here

Now that the preparations are complete, let's write the code from here. First import

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd

Once you've done this, go to the site you want to scrape. In this example, Trip advisor's Tokyo Tower page. Check the following two points. ・ URL ・ Where is the HTML you want to scrape (word of mouth this time)?

First of all, you can see the URL as it is, so I will omit the explanation. The latter presses "F12" to launch the developer tools. A window like the one above will appear. From here you can see where and how reviews are stored.

The confirmation method is simple, click "Shift + Ctrl + C", and then CLICK the word-of-mouth part of the article. Then, the corresponding part is selected in the previous window. You can see that the reviews are stored in the q class "IRsGHomP".

After that, specify this URL and location on the code and perform scraping.

#Df scraped reviews_Store in list
df_list = [] 
#Scrap 20 pages.
pages = range(0, 100, 5)

for page in pages:
#Since the URL is slightly different between the first page and the second and subsequent pages, branch by IF
    if page == 0:
        urlName = 'https://www.tripadvisor.jp/Attraction_Review-g14129730-d320047-Reviews-Tokyo_Tower-Shibakoen_Minato_Tokyo_Tokyo_Prefecture_Kanto.html'
    else:
        urlName = 'https://www.tripadvisor.jp/Attraction_Review-g14129730-d320047-Reviews-or' + str(page) + '-Tokyo_Tower-Shibakoen_Minato_Tokyo_Tokyo_Prefecture_Kanto.html'

    url = requests.get(urlName)
    soup = BeautifulSoup(url.content, "html.parser")
    
#Class of tag q from HTML'IRsGHoPm'Specify
    review = soup.find_all('q', class_ = 'IRsGHoPm')

#Store the extracted reviews in order
    for i in range(len(review)):
        _df = pd.DataFrame({'Number':i+1,
                            'review':[review[i].text]})
        
        df_list.append(_df)

At this point, your df_list should contain:

df_review = pd.concat(df_list).reset_index(drop=True)
print(df_review.shape)
df_review

Creating a word cloud

First, import MeCab and WordCloud

import MeCab
import matplotlib.pyplot as plt
from wordcloud import WordCloud

Enter the code by referring to Scraping the review site to investigate the number of words.

#MeCab preparation
tagger = MeCab.Tagger()
tagger.parse('')

#Combine all text data
all_text= ""
for s in df_review['review']:
    all_text += s

node = tagger.parseToNode(all_text)

#Extract nouns into a list
word_list = []
while node:
    word_type = node.feature.split(',')[0]
    if word_type == 'noun':
        word_list.append(node.surface)
    node = node.next

#Convert list to string
word_chain = ' '.join(word_list)

All you have to do is run wordcloud and it's ok.

#Creating stop words (words to exclude)
stopwords = ['']

#Word cloud creation
W = WordCloud(width=500, height=300, background_color='lightblue', colormap='inferno', font_path='C:\Windows\Fonts\yumin.ttf', stopwords = set(stopwords)).generate(word_chain)

plt.figure(figsize = (15, 12))
plt.imshow(W)
plt.axis('off')
plt.show()

Then it will be created as follows.

However, "no", "koto", and "tame" are unnecessary, so I will remove them. That's where the above stopword comes in.

#Creating stop words (words to exclude)
stopwords = ['of', 'thing', 'For']

#Word cloud creation
W = WordCloud(width=500, height=300, background_color='lightblue', colormap='inferno', font_path='C:\Windows\Fonts\yumin.ttf', stopwords = set(stopwords)).generate(word_chain)

plt.figure(figsize = (15, 12))
plt.imshow(W)
plt.axis('off')
plt.show()

Then it will be as follows.

I can understand something, I don't understand anything. .. .. Whether there are many people comparing it with Skytree or many people saying "I can see Skytree", there is no doubt that "Skytree" is of interest to Tokyo Tower users. It looks like it. Therefore, the word-of-mouth of Skytree is also a word cloud below.

Words such as "elevator" and "ticket" that were not mentioned much in Tokyo Tower (the letters were not big) stand out here. Also, "Tokyo Tower" is not noticeable. This area seems to be the difference between Tokyo Tower and Sky Tree.

end.

Supplement

It may be interesting to compare your company with the competition on company word-of-mouth sites such as Open Work. It seems that there are things that can be seen by comparing similar facilities, such as a word-of-mouth comparison of the five major dome of Sapporo Dome, Tokyo Dome, Nagoya Dome, Osaka Dome, and Fukuoka Dome. By the way, Open Work scraping requires header settings. See below for details. [Python] What to do when scraping 403 Forbidden: You do n’t have permission to access on this server

WEB scraping with python and try to make a word cloud from reviews