WEB scraping and extracting reviews and reviews from imports of MeCab and Beautiful Soup. Let's make it a wordcloud and visualize what is written! It is the content.
For example, TripAdvisor's "word-of-mouth" can be used as a word cloud (visualization of frequently-used words ≒ topics that many people bother to mention in word-of-mouth). It may be interesting to see the difference when comparing the visualization of Tokyo Tower and Sky Tree, and Tokyo Tower and Tsutenkaku. .. .. Is the idea.
Make something like this from the "word of mouth" in the red frame.
I tried web scraping for the first time and even made wordcloud, but I will introduce the site that I referred to at that time first. Scraping review sites to find out the number of words [For beginners] Try web scraping with Python Active engineers explain how to use MeCab in Python [for beginners] Use wordcloud on Windows with Anaconda / Jupyter (Tips)
First, install the required libraries. (If you already have it, please skip it.)
By the way, my environment is Windows 10, Anaconda (jupyter).
Beautiful Soup、request、wordcloud
Launch Ancaonda Prompt and install Beautiful Soup
and request
.
conda install beautifulsoup4
conda install request
conda install -c conda-forge wordcloud
MeCab Download and install "Binary package for MS-Windows" from the Official Site. In this case, the dictionary is included from the beginning. If you get used to it, you can change it to another dictionary. You will be asked for the character code during installation, but to "UTF-8"! Others can be left as they are. After the installation is complete, set the environment variables. · Search for "system details" (probably in the search window at the bottom left of the taskbar) -Select "Environment Variables" -Select the system environment variable "Path" -Click Edit and select New -Enter "C: \ Program Files (x86) \ MeCab \ bin" ・ Select OK and close the screen Active engineers explain how to use MeCab in Python [for beginners] has a procedure, so please take a look there as well.
Now that the preparations are complete, let's write the code from here. First import
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
Once you've done this, go to the site you want to scrape. In this example, Trip advisor's Tokyo Tower page. Check the following two points. ・ URL ・ Where is the HTML you want to scrape (word of mouth this time)?
First of all, you can see the URL as it is, so I will omit the explanation. The latter presses "F12" to launch the developer tools. A window like the one above will appear. From here you can see where and how reviews are stored.
The confirmation method is simple, click "Shift + Ctrl + C", and then CLICK the word-of-mouth part of the article. Then, the corresponding part is selected in the previous window. You can see that the reviews are stored in the q class "IRsGHomP".
After that, specify this URL and location on the code and perform scraping.
#Df scraped reviews_Store in list
df_list = []
#Scrap 20 pages.
pages = range(0, 100, 5)
for page in pages:
#Since the URL is slightly different between the first page and the second and subsequent pages, branch by IF
if page == 0:
urlName = 'https://www.tripadvisor.jp/Attraction_Review-g14129730-d320047-Reviews-Tokyo_Tower-Shibakoen_Minato_Tokyo_Tokyo_Prefecture_Kanto.html'
else:
urlName = 'https://www.tripadvisor.jp/Attraction_Review-g14129730-d320047-Reviews-or' + str(page) + '-Tokyo_Tower-Shibakoen_Minato_Tokyo_Tokyo_Prefecture_Kanto.html'
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")
#Class of tag q from HTML'IRsGHoPm'Specify
review = soup.find_all('q', class_ = 'IRsGHoPm')
#Store the extracted reviews in order
for i in range(len(review)):
_df = pd.DataFrame({'Number':i+1,
'review':[review[i].text]})
df_list.append(_df)
At this point, your df_list should contain:
df_review = pd.concat(df_list).reset_index(drop=True)
print(df_review.shape)
df_review
First, import MeCab and WordCloud
import MeCab
import matplotlib.pyplot as plt
from wordcloud import WordCloud
Enter the code by referring to Scraping the review site to investigate the number of words.
#MeCab preparation
tagger = MeCab.Tagger()
tagger.parse('')
#Combine all text data
all_text= ""
for s in df_review['review']:
all_text += s
node = tagger.parseToNode(all_text)
#Extract nouns into a list
word_list = []
while node:
word_type = node.feature.split(',')[0]
if word_type == 'noun':
word_list.append(node.surface)
node = node.next
#Convert list to string
word_chain = ' '.join(word_list)
All you have to do is run wordcloud and it's ok.
#Creating stop words (words to exclude)
stopwords = ['']
#Word cloud creation
W = WordCloud(width=500, height=300, background_color='lightblue', colormap='inferno', font_path='C:\Windows\Fonts\yumin.ttf', stopwords = set(stopwords)).generate(word_chain)
plt.figure(figsize = (15, 12))
plt.imshow(W)
plt.axis('off')
plt.show()
Then it will be created as follows.
However, "no", "koto", and "tame" are unnecessary, so I will remove them. That's where the above stopword comes in.
#Creating stop words (words to exclude)
stopwords = ['of', 'thing', 'For']
#Word cloud creation
W = WordCloud(width=500, height=300, background_color='lightblue', colormap='inferno', font_path='C:\Windows\Fonts\yumin.ttf', stopwords = set(stopwords)).generate(word_chain)
plt.figure(figsize = (15, 12))
plt.imshow(W)
plt.axis('off')
plt.show()
Then it will be as follows.
I can understand something, I don't understand anything. .. .. Whether there are many people comparing it with Skytree or many people saying "I can see Skytree", there is no doubt that "Skytree" is of interest to Tokyo Tower users. It looks like it. Therefore, the word-of-mouth of Skytree is also a word cloud below.
Words such as "elevator" and "ticket" that were not mentioned much in Tokyo Tower (the letters were not big) stand out here. Also, "Tokyo Tower" is not noticeable. This area seems to be the difference between Tokyo Tower and Sky Tree.
end.
It may be interesting to compare your company with the competition on company word-of-mouth sites such as Open Work. It seems that there are things that can be seen by comparing similar facilities, such as a word-of-mouth comparison of the five major dome of Sapporo Dome, Tokyo Dome, Nagoya Dome, Osaka Dome, and Fukuoka Dome. By the way, Open Work scraping requires header settings. See below for details. [Python] What to do when scraping 403 Forbidden: You do n’t have permission to access on this server
Recommended Posts