Trigger

Currently, I am an intern for data analysis at EXIDEA Co., Ltd., which develops SEO writing tools. It's been four months since I started working, but due to the influence of Corona, I have never met anyone in the company. But what are the characteristics of regular online drinking parties and daily meetings? I finally understand. Also, I often hear the word ** "recruitment" ** at recent monthly meetings. I think there are many companies that are focusing on recruiting activities using Wantedly, not just venture companies. In this article, ** Wantedly's story article will be a story to re-recognize the corporate characteristics and feelings that you want to convey to applicants using the package nlplot that makes it easy to visualize natural language. ** **

The source code is available on Github, so please feel free to contact us. https://github.com/yuuuusuke1997/Article_analysis

environment

・ MacOS -Python 3.7.6 ・ Jupyter Notebook ・ Zsh shell

Story flow

[Data collection (scraping)](https://qiita.com/yuuuusuke1997/items/247eb06583ae8f653c2a#1-%E3%83%87%E3%83%BC%E3%82%BF%E3%81% AE% E5% 8F% 8E% E9% 9B% 86% E3% 82% B9% E3% 82% AF% E3% 83% AC% E3% 82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0)
[Morphological analysis (MeCab)](https://qiita.com/yuuuusuke1997/items/247eb06583ae8f653c2a#2-%E5%BD%A2%E6%85%8B%E7%B4%A0%E8%A7%A3 % E6% 9E% 90mecab) Visualization (nlplot)
Visualization (nlplot)

1. Data collection (scraping)

1-1. Flow of scraping

In this scraping, we will move the web page as follows and get only all the articles of our company. Before scraping, we will do it with the permission of Wantedly. Thank you for your understanding in advance.

1-2. Advance preparation

The Wantedly web page loads the next article by scrolling to the bottom of the page. Therefore, Selenium, which automates browser operations, is used in the minimum necessary locations to acquire data. To operate the browser, you need to prepare a driver ** for your ** browser and install the ** Selenium library **. I love Google Chrome, so I downloaded the Chrome Driver from here and placed it in the following directory. In addition, please change * under Users to your own user name as appropriate.

`python`


$ cd /Users/*/documents/nlplot
$ ls
article_analysis.ipynb
chromedriver
post_articles.csv
user_dic.csv

Install the Selenium library with pip.

`python`


$ pip install selenium

If you want to know more about Selenium from installation to operation method, you can refer to the article here. Now that we're ready, we'll actually scrape.

1-3. Source code

`article_analysis.ipynb`


import json
import re
import time

import pandas as pd
import requests
from bs4 import BeautifulSoup as bs4
from selenium import webdriver

base_url = 'https://www.wantedly.com'


def scrape_path(url):
    """
Get the URL of the space detail page from the story list page

    Parameters
    --------------
    url: str
URL of the story list page

    Returns
    ----------
    path_list: list of str
A list containing the URL of the space detail page
    """

    path_list = []

    response = requests.get(url)
    soup = bs4(response.text, 'lxml')
    time.sleep(3)

    # <script data-placeholder-key="wtd-ssr-placeholder">Get the contents
    #At the beginning of the json character'//'To remove.string[3:]
    feeds = soup.find('script', {'data-placeholder-key': 'wtd-ssr-placeholder'}).string[3:]
    feed = json.loads(feeds)

    # {'body'}of'spaces'Get
    feed_spaces = feed['body'][list(feed['body'].keys())[0]]['spaces']
    for i in feed_spaces:
        space_path = base_url + i['post_space_path']
        path_list.append(space_path)

    return path_list


path_list = scrape_path('https://www.wantedly.com/companies/exidea/feed')


def scrape_url(path_list):
    """
Get the URL of the story detail page from the space detail page

    Parameters
    --------------
    path_list: list of str
A list containing the URL of the space detail page

    Returns
    ----------
    url_list: list of str
List containing URLs for story detail pages
    """

    url_list = []

    #Launch chrome(chromedriver is placed in the same directory as this file)
    driver = webdriver.Chrome('chromedriver')
    for feed_path in path_list:
        driver.get(feed_path)

        #Scroll to the bottom of the page and exit the program if you can no longer scroll
        #Height before scrolling
        last_height = driver.execute_script("return document.body.scrollHeight")

        while True:
            #Scroll to the bottom of the page
            driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

            #Selenium processing is too fast to load a new page, so forced wait
            time.sleep(3)

            #Height after scrolling
            new_height = driver.execute_script("return document.body.scrollHeight")

            # last_height is new_Scroll until it matches the height of height
            if new_height == last_height:
                break
            else:
                last_height = new_height
                continue

        soup = bs4(driver.page_source, 'lxml')
        time.sleep(3)
        # <div class="post-space-item" >Get the element of
        post_space = soup.find_all('div', class_='post-content')
        for post in post_space:
            # <"post-space-item">of<a>Get element
            url = base_url + post.a.get('href')
            url_list.append(url)

    url_list = list(set(url_list))

    #Close web page
    driver.close()
    return url_list


url_list = scrape_url(path_list)


def get_text(url_list, wrong_name, correct_name):
    """
Get text from story details page

    Parameters
    --------------
    url_list: list of str
List containing URLs for story detail pages
    wrong_name: str
Wrong company name
    correct_name: str
Correct company name

    Returns
    ----------
    text_list: list of str
A list of stories
    """

    text_list = []

    for url in url_list:
        response = requests.get(url)
        soup = bs4(response.text, 'lxml')
        time.sleep(3)

        # <section class="article-description" data-post-id="○○○○○○">In<p>Get all elements
        articles = soup.find('section', class_='article-description').find_all('p')
        for article in articles:
            #Split by delimiter
            for text in re.split('[\n!?！？。]', article.text):
                #Preprocessing
                replaced_text = text.lower()  #Lowercase conversion
                replaced_text = re.sub(wrong_name, correct_name, replaced_text)  #Convert company name to uppercase
                replaced_text = re.sub(r'https?://[\w/:%#\$&\?\(\)~\.=\+\-…]+', '', replaced_text)  #Remove URL
                replaced_text = re.sub('[0-9]', '', replaced_text)  #Exclude numbers
                replaced_text = re.sub('[,:;-~%（）]', '', replaced_text)  #Replace the symbol with a half-width space
                replaced_text = re.sub('[,:;·~%()※""【】(Lol)]', '', replaced_text)  #Replace the symbol with a half-width space
                replaced_text = re.sub('　', '', replaced_text)  # \Remove u3000

                text_list.append(replaced_text)

    text_list = [x for x in text_list if x != '']
    return text_list


text_list = get_text(url_list, 'exidea', 'EXIDEA')

Save the retrieved text in a CSV file.

`nlplot_articles.ipynb`


df_text = pd.DataFrame(text_list, columns=['text'])
df_text.to_csv('post_articles.csv', index=False)

スクリーンショット 2020-09-17 23.27.40.png

2. Morphological analysis (MeCab)

2-1. Flow to morphological analysis

Installation and environment settings of MeCab main unit
Add IPA dictionary
Add NEologd dictionary
Creating a user dictionary
Finally analysis

2-1. A short break

From here, I will start installing MeCab and making various preparations, but it will not work as well as I expected and my heart will be broken, so I hope it will lead to motivation.

In the first place, why do you do such a tedious task? If you think that hitting $ brew install mecab will do just one shot, you may be. However, in order to get the result of morphological analysis with nlplot as desired, it is necessary to register the character code in UTF-8 in the user dictionary with the company-specific business division name and company word as proper nouns. As a result of installing with brew for ease, the character code became EUC-JP, and I had to take the trouble twice. Therefore, if you want to stick to the output result, please try the method from now on. If you want to try it easily, please install it with brew by referring to the following.

Preparing the environment for using MeCab on Mac

Addition If anyone knows how to specify the character code with brew, I would appreciate it if you could teach me in the comment section.

2-2. Installation and environment settings of MeCab main unit

MeCab From the official website, use the curl command to download ** MeCab itself ** and ** IPA dictionary **. This time, install it in the local environment. First, install MeCab itself.

`python`


#Create the installation directory of mecab in the local environment
$ mkdir /Users/*/opt/mecab
$ cd /Users/*/opt/mecab
#In the current directory-o Download by specifying the file name with the option
$ curl -Lo mecab-0.996.tar.gz 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7cENtOXlicTFaRUE'
#Unzip the source code file
$ tar zxfv mecab-0.996.tar.gz
$ cd mecab-0.996
#UTF character code-Check if you can compile by specifying 8
$ ./configure --prefix=/Users/*/opt/mecab --with-charset=utf-8
#Compile the Makefile created by configure
$ make
#Check if it works properly before installation
$ make check
#Binary files compiled with make/Users/*/opt/Install on mecab
$ make install

Done

If you are wondering what configure, make, make install is, here may be helpful.

Now that it's installed, let's go through the path so that we can run the mecab command.

`python`


#Check shell type
$ echo $SHELL
/bin/zsh
# .Add path to zshrc
$ echo 'export PATH=/Users/*/opt/mecab/bin:$PATH' >>  ~/.zshrc

"""
Caution:Last by login shell(~/.zshrc)change
Example) $ echo 'export PATH=/Users/*/opt/mecab/bin:$PATH' >>  ~/.bash_profile
"""

#Reflects shell settings
$ source ~/.zshrc
#Check if the pass passed
$ which mecab
/Users/*/opt/mecab/bin/mecab

Done

Reference article: What is PATH?

2-3. Addition of IPA dictionary

`python`


#Move to the starting directory
$ cd /Users/*/opt/mecab
#In the current directory-o Download by specifying the file name with the option
$ curl -Lo mecab-ipadic-2.7.0-20070801.tar.gz 'https://drive.google.com/uc?export=download&id=0B4y35FiV1wh7MWVlSDBCSXZMTXM'
#Unzip the source code file
$ tar zxfv mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
#UTF character code-Check if you can compile by specifying 8
$ ./configure --prefix=/Users/*/opt/mecab --with-charset=utf-8
#Compile the Makefile created by configure
$ make
#Binary files compiled with make/Users/*/opt/Install on mecab
$ make install

Done

#Confirmation of character code
#Character code is EUC-For JP, UTF-Change to 8
$ mecab -P | grep config-charset
config-charset: EUC-JP
#Search config file
$ find /Users -name dicrc
/Users/*/opt/mecab/mecab-ipadic-2.7.0-20070801/dicrc
$ vim /Users/*/opt/mecab/mecab-ipadic-2.7.0-20070801/dicrc 
[Before change] config-charset = EUC-JP
[After change] config-charset = UTF-8

$ mecab
I will stop humans! Jojo
I noun,Pronoun,General,*,*,*,I,me,me
Is a particle,Particle,*,*,*,*,Is,C,Wow
Human noun,General,*,*,*,*,Human,Ningen,Ningen
Particles,Case particles,General,*,*,*,To,Wo,Wo
Quit verb,Independence,*,*,One step,Uninflected word,Stop,Yamel,Yamel
Particles,Final particle,*,*,*,*,I'm sorry,Zo,Zo
!! symbol,General,*,*,*,*,！,！,！
Jojo noun,Proper noun,Organization,*,*,*,*
EOS

#IPA dictionary directory check
$ find /Users -name ipadic
/Users/*/opt/mecab/lib/mecab/dic/ipadic

2-3. Addition of NEologd dictionary

`python`


#Move to the starting directory
cd /Users/*/opt/mecab
#Download the source code from github
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
$ cd mecab-ipadic-neologd
#Enter "yes" on the screen to execute and check the result
$ ./bin/install-mecab-ipadic-neologd -n

Done

#Confirmation of character code
#Character code is EUC-For JP, UTF-Change to 8
$ mecab -d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd -P | grep config-charset
config-charset: EUC-JP
#Search config file
$ find /Users -name dicrc
/Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd/dicrc
$ vim /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd/dicrc
[Before change] config-charset = EUC-JP
[After change] config-charset = UTF-8

#NEologd dictionary directory check
$ find /Users -name mecab-ipadic-neologd
/Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd

$echo “I'm quitting humans!| mecab -d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd
“Symbol,Open parentheses,*,*,*,*,“,“,“
I will stop humans! noun,Proper noun,General,*,*,*,I will quit humans!,Orehaningen Woyamerzo,Orewaningen Oyamelzo
Jojo noun,General,*,*,*,*,*
EOS

Github Official: mecab-ipadic-neologd

`python`


#Finally pip to be able to use mecab with python3
$ pip install mecab-python3

2-4. Creating a user dictionary

The user dictionary creates words that the system dictionary cannot handle by giving meaning to the user.

First, create a csv file according to the format of the word you want to add. Visualize it once, and if there is a word you are interested in, try adding the word to the csv file.

`python`


"""
format
Surface type,Left context ID,Right context ID,cost,Part of speech,Part of speech細分類1,Part of speech細分類2,Part of speech細分類3,Utilization type,Inflected form,Prototype,reading,pronunciation
"""

#csv file creation
$ echo 'Internship student,-1,-1,1,noun,General,*,*,*,*,*,*,*,Internship'"\n"'Core value,-1,-1,1,noun,General,*,*,*,*,*,*,*,Core value'"\n"'Meetup,-1,-1,1,noun,General,*,*,*,*,*,*,*,Meetup' > /Users/*/Documents/nlplot/user_dic.csv

#Check the character code of the csv file
$ file /Users/*/Documents/nlplot/user_dic.csv
/users/*/documents/nlplot/user_dic.csv: UTF-8 Unicode text

Next, compile the created csv file into a user dictionary.

`python`


#Create a directory to save the user dictionary
$ mkdir /Users/*/opt/mecab/lib/mecab/dic/userdic

"""
-d Directory containing system dictionaries
-u user-Where to save the dictionary
-f CSV file character code
-t User dictionary character code/Where to save the csv file
"""

##Create a user dictionary
/Users/*/opt/mecab/libexec/mecab/mecab-dict-index \
-d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd \
-u /Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic \
-f utf-8 -t utf-8 /Users/*/Documents/nlplot/user_dic.csv

# userdic.Confirm that dic is created
$ find /Users -name userdic.dic
/Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic

Now that we have installed mecab and created a user dictionary, we will move on to morphological analysis.

Reference article: How to add words

2-5. Finally analysis

First, load the csv file created during scraping.

`nlplot_articles.ipynb`


df = pd.read_csv('post_articles.csv')
df.head()

スクリーンショット 2020-09-21 0.07.51.png

In nlplot, we want to output sentences word by word, so we perform morphological analysis with nouns.

`article_analysis.ipynb`


import MeCab

def download_slothlib():
    """
Load SlothLib and create a stopword

    Returns
    ----------
    slothlib_stopwords: list of str
List containing stop words
    """

    slothlib_path = 'http://svn.sourceforge.jp/svnroot/slothlib/CSharp/Version1/SlothLib/NLP/Filter/StopWord/word/Japanese.txt'
    response = requests.get(slothlib_path)
    soup = bs4(response.content, 'html.parser')
    slothlib_stopwords = [line.strip() for line in soup]
    slothlib_stopwords = slothlib_stopwords[0].split('\r\n')
    slothlib_stopwords = [x for x in slothlib_stopwords if x != '']
    return slothlib_stopwords


stopwords = download_slothlib()


def add_stopwords():
    """
Add stop words to stop words

    Returns
    ----------
    stopwords: list of str
List containing stop words
    """

    add_words = ['See', 'Company', 'I'd love to', 'By all means', 'Story', '弊Company', 'Human', 'What', 'article', 'Other than', 'Hmm', 'of', 'Me', 'Sa', 'like this']
    stopwords.extend(add_words)
    return stopwords


stopwords = add_stopwords()


def tokenize_text(text):
    """
Extract only nouns by morphological analysis

    Parameters
    --------------
    text: str
Text stored in dataframe

    Returns
    ----------
    nons_list: list of str
A list that contains only nouns after morphological analysis
    """

    #Specify the directory where the user dictionary and neologd dictionary are saved
    tagger = MeCab.Tagger('-d /Users/*/opt/mecab/lib/mecab/dic/mecab-ipadic-neologd -u /Users/*/opt/mecab/lib/mecab/dic/userdic/userdic.dic')
    node = tagger.parseToNode(text)
    nons_list = []
    while node:
        if node.feature.split(',')[0] in ['noun'] and node.surface not in stopwords:
            nons_list.append(node.surface)
        node = node.next
    return nons_list


df['words'] = df['text'].apply(tokenize_text)

`article_analysis.ipynb`


df.head()

スクリーンショット 2020-09-21 0.28.28.png

3. Visualization (nlplot)

3-1. Advance preparation

`python`


$ pip install nlplot

3-2. uni-gram

`nlplot_articles.ipynb`


import nlplot

#Specify df words
npt = nlplot.NLPlot(df, taget_col='words')

# top_Top 2 words that appear frequently in n, min_Specify frequent subwords with freq
#Top 2 words: ['Company', 'jobs']
stopwords = npt.get_stopword(top_n=2, min_freq=0)

npt.bar_ngram(
    title='uni-gram',
    xaxis_label='word_count',
    yaxis_label='word',
    ngram=1,
    top_n=50,
    stopwords=stopwords,
    save=True
)

3-3. bi-gram

`nlplot_articles.ipynb`


npt.bar_ngram(
    title='bi-gram',
    xaxis_label='word_count',
    yaxis_label='word',
    ngram=2,
    top_n=50,
    stopwords=stopwords,
    save=True
)

3-4. tri-gram

`nlplot_articles.ipynb`


npt.bar_ngram(
    title='tri-gram',
    xaxis_label='word_count',
    yaxis_label='word',
    ngram=3,
    top_n=50,
    stopwords=stopwords,
    save=True
)

3-5. tree map

`nlplot_articles.ipynb`


npt.treemap(
    title='tree map',
    ngram=1,
    stopwords=stopwords,
    width=1200,
    height=800,
    save=True
)

3-6. wordcloud

`nlplot_articles.ipynb`


npt.wordcloud(
    stopwords=stopwords,
    max_words=100,
    max_font_size=100,
    colormap='tab20_r',
    save=True
)

3-7. Co-occurrence network

`nlplot_articles.ipynb`


npt.build_graph(stopwords=stopwords, min_edge_frequency=13)

display(
    npt.node_df, npt.node_df.shape,
    npt.edge_df, npt.edge_df.shape
)

npt.co_network(
    title='All sentiment Co-occurrence network',
    color_palette='hls',
    save=True
)

3-8. sunburst chart

`nlplot_articles.ipynb`


npt.sunburst(
    title='All sentiment sunburst chart',
    colorscale=True,
    color_continuous_scale='Oryel',
    width=800,
    height=600,
    save=True
)

Reference article: The library "nlplot" that can easily visualize and analyze natural language has been released

Summary

By visualizing it, I felt that I was able to embody the action guideline "The share" that EXIDEA cherishes again. In particular, The share's Happy and Sincere. And Altruistic is prominent in the article, and as a result, I think I was able to meet friends who can talk about the best working environment, what they want to achieve, and their worries. There are still few things that I can contribute to the company in my daily work, but I would like to maximize what I can do now, such as fully committing to the task at hand and sending it to the outside.

in conclusion

In this article, I was able to reaffirm the importance of preprocessing. I started with the desire to try nlplot, but when I visualized it without preprocessing, proper nouns were displayed as morphemes in bi-gram and tri-gram, and the result was disastrous. Thanks to that, I think that it was the best harvest to be able to learn the knowledge around Linux when installing mecab and creating a user dictionary. Rather than acquiring it as knowledge, I will utilize it for future learning so as not to neglect the basic thing of actually moving my hands.

It's been a long time, but thank you for reading this far. If you find any mistakes, I would be very grateful if you could point them out in the comments.

[Python] Wouldn't it be the best and highest if you could grasp the characteristics of a company with nlplot?

Trigger

environment

Story flow

1. Data collection (scraping)

1-1. Flow of scraping

1-2. Advance preparation

python

python

1-3. Source code

article_analysis.ipynb

nlplot_articles.ipynb

2. Morphological analysis (MeCab)

2-1. Flow to morphological analysis

2-1. A short break

2-2. Installation and environment settings of MeCab main unit

python

python

2-3. Addition of IPA dictionary

python

2-3. Addition of NEologd dictionary

python

python

2-4. Creating a user dictionary

python

python

2-5. Finally analysis

nlplot_articles.ipynb

article_analysis.ipynb

article_analysis.ipynb

3. Visualization (nlplot)

3-1. Advance preparation

python

nlplot_articles.ipynb

nlplot_articles.ipynb

nlplot_articles.ipynb

nlplot_articles.ipynb

nlplot_articles.ipynb

3-7. Co-occurrence network

nlplot_articles.ipynb

nlplot_articles.ipynb

Summary

in conclusion

`python`

`python`

`article_analysis.ipynb`

`nlplot_articles.ipynb`

`python`

`python`

`python`

`python`

`python`

`python`

`python`

`nlplot_articles.ipynb`

`article_analysis.ipynb`

`article_analysis.ipynb`

`python`

`nlplot_articles.ipynb`

`nlplot_articles.ipynb`

`nlplot_articles.ipynb`

`nlplot_articles.ipynb`

`nlplot_articles.ipynb`

`nlplot_articles.ipynb`

`nlplot_articles.ipynb`