Currently, I am an intern for data analysis at EXIDEA Co., Ltd., which develops SEO writing tools. It's been four months since I started working, but due to the influence of Corona, I have never met anyone in the company. But what are the characteristics of regular online drinking parties and daily meetings? I finally understand. Also, I often hear the word ** "recruitment" ** at recent monthly meetings. I think there are many companies that are focusing on recruiting activities using Wantedly, not just venture companies. In this article, ** Wantedly's story article will be a story to re-recognize the corporate characteristics and feelings that you want to convey to applicants using the package nlplot that makes it easy to visualize natural language. ** **
The source code is available on Github, so please feel free to contact us. https://github.com/yuuuusuke1997/Article_analysis
・ MacOS -Python 3.7.6 ・ Jupyter Notebook ・ Zsh shell
In this scraping, we will move the web page as follows and get only all the articles of our company. Before scraping, we will do it with the permission of Wantedly. Thank you for your understanding in advance.
The Wantedly web page loads the next article by scrolling to the bottom of the page. Therefore, Selenium, which automates browser operations, is used in the minimum necessary locations to acquire data. To operate the browser, you need to prepare a driver ** for your ** browser and install the ** Selenium library **. I love Google Chrome, so I downloaded the Chrome Driver from here and placed it in the following directory. In addition, please change * under Users to your own user name as appropriate.
python
$ cd /Users/*/documents/nlplot
$ ls
article_analysis.ipynb
chromedriver
post_articles.csv
user_dic.csv
Install the Selenium library with pip.
python
$ pip install selenium
If you want to know more about Selenium from installation to operation method, you can refer to the article here. Now that we're ready, we'll actually scrape.
article_analysis.ipynb
import json
import re
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup as bs4
from selenium import webdriver
base_url = 'https://www.wantedly.com'
def scrape_path(url):
"""
Get the URL of the space detail page from the story list page
Parameters
--------------
url: str
URL of the story list page
Returns
----------
path_list: list of str
A list containing the URL of the space detail page
"""
path_list = []
response = requests.get(url)
soup = bs4(response.text, 'lxml')
time.sleep(3)
# <script data-placeholder-key="wtd-ssr-placeholder">Get the contents
#At the beginning of the json character'//'To remove.string[3:]
feeds = soup.find('script', {'data-placeholder-key': 'wtd-ssr-placeholder'}).string[3:]
feed = json.loads(feeds)
# {'body'}of'spaces'Get
feed_spaces = feed['body'][list(feed['body'].keys())[0]]['spaces']
for i in feed_spaces:
space_path = base_url + i['post_space_path']
path_list.append(space_path)
return path_list
path_list = scrape_path('https://www.wantedly.com/companies/exidea/feed')
def scrape_url(path_list):
"""
Get the URL of the story detail page from the space detail page
Parameters
--------------
path_list: list of str
A list containing the URL of the space detail page
Returns
----------
url_list: list of str
List containing URLs for story detail pages
"""
url_list = []
#Launch chrome(chromedriver is placed in the same directory as this file)
driver = webdriver.Chrome('chromedriver')
for feed_path in path_list:
driver.get(feed_path)
#Scroll to the bottom of the page and exit the program if you can no longer scroll
#Height before scrolling
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
#Scroll to the bottom of the page
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')
#Selenium processing is too fast to load a new page, so forced wait
time.sleep(3)
#Height after scrolling
new_height = driver.execute_script("return document.body.scrollHeight")
# last_height is new_Scroll until it matches the height of height
if new_height == last_height:
break
else:
last_height = new_height
continue
soup = bs4(driver.page_source, 'lxml')
time.sleep(3)
# <div class="post-space-item" >Get the element of
post_space = soup.find_all('div', class_='post-content')
for post in post_space:
# <"post-space-item">of<a>Get element
url = base_url + post.a.get('href')
url_list.append(url)
url_list = list(set(url_list))
#Close web page
driver.close()
return url_list
url_list = scrape_url(path_list)
def get_text(url_list, wrong_name, correct_name):
"""
Get text from story details page
Parameters
--------------
url_list: list of str
List containing URLs for story detail pages
wrong_name: str
Wrong company name
correct_name: str
Correct company name
Returns
----------
text_list: list of str
A list of stories
"""
text_list = []
for url in url_list:
response = requests.get(url)
soup = bs4(response.text, 'lxml')
time.sleep(3)
# <section class="article-description" data-post-id="○○○○○○">In<p>Get all elements
articles = soup.find('section', class_='article-description').find_all('p')
for article in articles:
#Split by delimiter
for text in re.split('[\n!?!?。]', article.text):
#Preprocessing
replaced_text = text.lower() #Lowercase conversion
replaced_text = re.sub(wrong_name, correct_name, replaced_text) #Convert company name to uppercase
replaced_text = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', '', replaced_text) #Remove URL
replaced_text = re.sub('[0-9]', '', replaced_text) #Exclude numbers
replaced_text = re.sub('[,:;-~%()]', '', replaced_text) #Replace the symbol with a half-width space
replaced_text = re.sub('[,:;·~%()※""【】(Lol)]', '', replaced_text) #Replace the symbol with a half-width space
replaced_text = re.sub(' ', '', replaced_text) # \Remove u3000
text_list.append(replaced_text)
text_list = [x for x in text_list if x != '']
return text_list
text_list = get_text(url_list, 'exidea', 'EXIDEA')
Save the retrieved text in a CSV file.
nlplot_articles.ipynb
df_text = pd.DataFrame(text_list, columns=['text'])
df_text.to_csv('post_articles.csv', index=False)
From here, I will start installing MeCab and making various preparations, but it will not work as well as I expected and my heart will be broken, so I hope it will lead to motivation.
In the first place, why do you do such a tedious task? If you think that hitting $ brew install mecab will do just one shot, you may be. However, in order to get the result of morphological analysis with nlplot as desired, it is necessary to register the character code in UTF-8 in the user dictionary with the company-specific business division name and company word as proper nouns. As a result of installing with brew for ease, the character code became EUC-JP, and I had to take the trouble twice. Therefore, if you want to stick to the output result, please try the method from now on. If you want to try it easily, please install it with brew by referring to the following.
Preparing the environment for using MeCab on Mac
MeCab From the official website, use the curl command to download ** MeCab itself ** and ** IPA dictionary **. This time, install it in the local environment. First, install MeCab itself.
python
#Create the installation directory of mecab in the local environment
$ mkdir /Users/*/opt/mecab
$ cd /Users/*/opt/mecab
#In the current directory-o Download by specifying the file name with the option
$ curl -Lo mecab-0.996.tar.gz 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE'
#Unzip the source code file
$ tar zxfv mecab-0.996.tar.gz
$ cd mecab-0.996
#UTF character code-Check if you can compile by specifying 8
$ ./configure --prefix=/Users/*/opt/mecab --with-charset=utf-8
#Compile the Makefile created by configure
$ make
#Check if it works properly before installation
$ make check
#Binary files compiled with make/Users/*/opt/Install on mecab
$ make install
Done
If you are wondering what configure, make, make install is, here may be helpful.
Now that it's installed, let's go through the path so that we can run the mecab command.
python
#Check shell type
$ echo $SHELL
/bin/zsh
# .Add path to zshrc
$ echo 'export PATH=/Users/*/opt/mecab/bin:$PATH' >> ~/.zshrc
"""
Caution:Last by login shell(~/.zshrc)change
Example) $ echo 'export PATH=/Users/*/opt/mecab/bin:$PATH' >> ~/.bash_profile
"""
#Reflects shell settings
$ source ~/.zshrc
#Check if the pass passed
$ which mecab
/Users/*/opt/mecab/bin/mecab
Done
Reference article: What is PATH?
python
#Move to the starting directory
$ cd /Users/*/opt/mecab
#In the current directory-o Download by specifying the file name with the option
$ curl -Lo mecab-ipadic-2.7.0-20070801.tar.gz 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM'
#Unzip the source code file
$ tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
#UTF character code-Check if you can compile by specifying 8
$ ./configure --prefix=/Users/*/opt/mecab --with-charset=utf-8
#Compile the Makefile created by configure
$ make
#Binary files compiled with make/Users/*/opt/Install on mecab
$ make install
Done
#Confirmation of character code
#Character code is EUC-For JP, UTF-Change to 8
$ mecab -P | grep config-charset
config-charset: EUC-JP
#Search config file
$ find /Users -name dicrc
/Users/*/opt/mecab/mecab-ipadic-2.7.0-20070801/dicrc
$ vim /Users/*/opt/mecab/mecab-ipadic-2.7.0-20070801/dicrc
[Before change] config-charset = EUC-JP
[After change] config-charset = UTF-8
$ mecab
I will stop humans! Jojo
I noun,Pronoun,General,*,*,*,I,me,me
Is a particle,Particle,*,*,*,*,Is,C,Wow
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Uninflected word,Stop,Yamel,Yamel
Particles,Final particle,*,*,*,*,I'm sorry,Zo,Zo
!! symbol,General,*,*,*,*,!,!,!
Jojo noun,Proper noun,Organization,*,*,*,*
EOS
#IPA dictionary directory check
$ find /Users -name ipadic
/Users/*/opt/mecab/lib/mecab/dic/ipadic
python
#Move to the starting directory
cd /Users/*/opt/mecab
#Download the source code from github
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
#Enter "yes" on the screen to execute and check the result
$ ./bin/install-mecab-ipadic-neologd -n
Done
#Confirmation of character code
#Character code is EUC-For JP, UTF-Change to 8
$ mecab -d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd -P | grep config-charset
config-charset: EUC-JP
#Search config file
$ find /Users -name dicrc
/Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd/dicrc
$ vim /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd/dicrc
[Before change] config-charset = EUC-JP
[After change] config-charset = UTF-8
#NEologd dictionary directory check
$ find /Users -name mecab-ipadic-neologd
/Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd
$echo “I'm quitting humans!| mecab -d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd
“Symbol,Open parentheses,*,*,*,*,“,“,“
I will stop humans! noun,Proper noun,General,*,*,*,I will quit humans!,Orehaningen Woyamerzo,Orewaningen Oyamelzo
Jojo noun,General,*,*,*,*,*
EOS
Github Official: mecab-ipadic-neologd
python
#Finally pip to be able to use mecab with python3
$ pip install mecab-python3
The user dictionary creates words that the system dictionary cannot handle by giving meaning to the user.
First, create a csv file according to the format of the word you want to add. Visualize it once, and if there is a word you are interested in, try adding the word to the csv file.
python
"""
format
Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Utilization type,Inflected form,Prototype,reading,pronunciation
"""
#csv file creation
$ echo 'Internship student,-1,-1,1,noun,General,*,*,*,*,*,*,*,Internship'"\n"'Core value,-1,-1,1,noun,General,*,*,*,*,*,*,*,Core value'"\n"'Meetup,-1,-1,1,noun,General,*,*,*,*,*,*,*,Meetup' > /Users/*/Documents/nlplot/user_dic.csv
#Check the character code of the csv file
$ file /Users/*/Documents/nlplot/user_dic.csv
/users/*/documents/nlplot/user_dic.csv: UTF-8 Unicode text
Next, compile the created csv file into a user dictionary.
python
#Create a directory to save the user dictionary
$ mkdir /Users/*/opt/mecab/lib/mecab/dic/userdic
"""
-d Directory containing system dictionaries
-u user-Where to save the dictionary
-f CSV file character code
-t User dictionary character code/Where to save the csv file
"""
##Create a user dictionary
/Users/*/opt/mecab/libexec/mecab/mecab-dict-index \
-d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd \
-u /Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic \
-f utf-8 -t utf-8 /Users/*/Documents/nlplot/user_dic.csv
# userdic.Confirm that dic is created
$ find /Users -name userdic.dic
/Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic
Now that we have installed mecab and created a user dictionary, we will move on to morphological analysis.
Reference article: How to add words
First, load the csv file created during scraping.
nlplot_articles.ipynb
df = pd.read_csv('post_articles.csv')
df.head()
In nlplot, we want to output sentences word by word, so we perform morphological analysis with nouns.
article_analysis.ipynb
import MeCab
def download_slothlib():
"""
Load SlothLib and create a stopword
Returns
----------
slothlib_stopwords: list of str
List containing stop words
"""
slothlib_path = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
response = requests.get(slothlib_path)
soup = bs4(response.content, 'html.parser')
slothlib_stopwords = [line.strip() for line in soup]
slothlib_stopwords = slothlib_stopwords[0].split('\r\n')
slothlib_stopwords = [x for x in slothlib_stopwords if x != '']
return slothlib_stopwords
stopwords = download_slothlib()
def add_stopwords():
"""
Add stop words to stop words
Returns
----------
stopwords: list of str
List containing stop words
"""
add_words = ['See', 'Company', 'I'd love to', 'By all means', 'Story', '弊Company', 'Human', 'What', 'article', 'Other than', 'Hmm', 'of', 'Me', 'Sa', 'like this']
stopwords.extend(add_words)
return stopwords
stopwords = add_stopwords()
def tokenize_text(text):
"""
Extract only nouns by morphological analysis
Parameters
--------------
text: str
Text stored in dataframe
Returns
----------
nons_list: list of str
A list that contains only nouns after morphological analysis
"""
#Specify the directory where the user dictionary and neologd dictionary are saved
tagger = MeCab.Tagger('-d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd -u /Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic')
node = tagger.parseToNode(text)
nons_list = []
while node:
if node.feature.split(',')[0] in ['noun'] and node.surface not in stopwords:
nons_list.append(node.surface)
node = node.next
return nons_list
df['words'] = df['text'].apply(tokenize_text)
article_analysis.ipynb
df.head()
python
$ pip install nlplot
3-2. uni-gram
nlplot_articles.ipynb
import nlplot
#Specify df words
npt = nlplot.NLPlot(df, taget_col='words')
# top_Top 2 words that appear frequently in n, min_Specify frequent subwords with freq
#Top 2 words: ['Company', 'jobs']
stopwords = npt.get_stopword(top_n=2, min_freq=0)
npt.bar_ngram(
title='uni-gram',
xaxis_label='word_count',
yaxis_label='word',
ngram=1,
top_n=50,
stopwords=stopwords,
save=True
)
3-3. bi-gram
nlplot_articles.ipynb
npt.bar_ngram(
title='bi-gram',
xaxis_label='word_count',
yaxis_label='word',
ngram=2,
top_n=50,
stopwords=stopwords,
save=True
)
3-4. tri-gram
nlplot_articles.ipynb
npt.bar_ngram(
title='tri-gram',
xaxis_label='word_count',
yaxis_label='word',
ngram=3,
top_n=50,
stopwords=stopwords,
save=True
)
3-5. tree map
nlplot_articles.ipynb
npt.treemap(
title='tree map',
ngram=1,
stopwords=stopwords,
width=1200,
height=800,
save=True
)
3-6. wordcloud
nlplot_articles.ipynb
npt.wordcloud(
stopwords=stopwords,
max_words=100,
max_font_size=100,
colormap='tab20_r',
save=True
)
nlplot_articles.ipynb
npt.build_graph(stopwords=stopwords, min_edge_frequency=13)
display(
npt.node_df, npt.node_df.shape,
npt.edge_df, npt.edge_df.shape
)
npt.co_network(
title='All sentiment Co-occurrence network',
color_palette='hls',
save=True
)
3-8. sunburst chart
nlplot_articles.ipynb
npt.sunburst(
title='All sentiment sunburst chart',
colorscale=True,
color_continuous_scale='Oryel',
width=800,
height=600,
save=True
)
Reference article: The library "nlplot" that can easily visualize and analyze natural language has been released
By visualizing it, I felt that I was able to embody the action guideline "The share" that EXIDEA cherishes again. In particular, The share's Happy and Sincere. And Altruistic is prominent in the article, and as a result, I think I was able to meet friends who can talk about the best working environment, what they want to achieve, and their worries. There are still few things that I can contribute to the company in my daily work, but I would like to maximize what I can do now, such as fully committing to the task at hand and sending it to the outside.
In this article, I was able to reaffirm the importance of preprocessing. I started with the desire to try nlplot, but when I visualized it without preprocessing, proper nouns were displayed as morphemes in bi-gram and tri-gram, and the result was disastrous. Thanks to that, I think that it was the best harvest to be able to learn the knowledge around Linux when installing mecab and creating a user dictionary. Rather than acquiring it as knowledge, I will utilize it for future learning so as not to neglect the basic thing of actually moving my hands.
It's been a long time, but thank you for reading this far. If you find any mistakes, I would be very grateful if you could point them out in the comments.
Recommended Posts