I tried to visualize bookmarks flying to Slack with Doc2Vec and PCA

This is the article on the 11th day of the Advent calendar.

this is

A distributed representation of the article titles bookmarked by everyone (4 people) acquired and visualized

If you bookmark it, IFTTT will pick it up and spit it out to slack, so process it from there

Prerequisites

--Environment -Link to Dockerfile

References / used

[Job change meeting] Try to classify companies by processing word-of-mouth in natural language with word2vec
[word2vec] Let's visualize the result of natural language processing of company reviews
[Japanese preprocessing memorandum by python](https://datumstudio.jp/blog/python%E3%81%AB%E3%82%88%E3%82%8B%E6%97%A5%E6% 9C% AC% E8% AA% 9E% E5% 89% 8D% E5% 87% A6% E7% 90% 86% E5% 82% 99% E5% BF% 98% E9% 8C% B2)
Easy backup of slack chat log
models.doc2vec(gensim)
slack-dump
slack api

What you can do

Forecast before doing

--R-kun --Many gadgets and security systems --Number of bookmarks 79 --Mr. Y ――The widest range of the four ――Actually, only this user is posting the selected ones for the purpose of sharing with everyone. --Number of bookmarks 864 --M-kun --Web and machine learning, etc. --240 bookmarks --S (self) ――In addition to the Web, machine learning, and gadgets, I throw things like "Saury is not caught this year" --Number of bookmarks 896

result

Range and overlap are intuitively close to expectations

Preparation

Mechanism for letting IFTTT post Hatena bookmarks to Slack

procedure

Details are omitted, but the mechanism itself is completed according to the flow shown in the figure below, but it is necessary to enter the URL to receive RSS Feed between the 4th and 5th frames, and this time it is a Hatena bookmark, so http://b.hatena It becomes .ne.jp/<username>/rss

Like this

Reason

Should I like users in Hatena?

That's not bad (rather, you can do both), but you can feel free to comment like this in the community.

Should I use the Slack command / feed?

Since you can customize the post, you can use it for play like this time, and using the Slack command takes up a lot of space, which is a problem

Get posted messages from Slack

These two types seem to be easy to do

Slack API
- https://api.slack.com/methods/channels.history
Go tool (this time)
- https://github.com/joefitzgerald/slack-dump --Binary is this
  - https://github.com/PyYoshi/slack-dump/releases

Either way, you need a token, so get it from here

$ wget https://github.com/PyYoshi/slack-dump/releases/download/v1.1.3/slack-dump-v1.1.3-linux-386.tar.gz
$ tar -zxvf slack-dump-v1.1.3-linux-386.tar.gz
$ linux-386/slack-dump -t=<token> <channel>

Take it with DM and move it to another place because it is an obstacle

`python`


import zipfile, os

os.mkdir('dumps')
with zipfile.ZipFile('./zipfile_name') as z:
    for n in z.namelist():
        if 'channel_name' in n:
            z.extract(n, './dumps')

Open the file and get the contents, because it is by date, make all one

`python`


import json, glob

posts = []
files = glob.glob('./dumps/channel/<channel_name>/*.json'.format(dirname))
for file in files:
    with open(file) as f:
        posts += json.loads(f.read())

Extract the Message and associate the article title with the user name (this area depends on the settings in IFTTT)

`python`


user_post_dic = {
    'Y': [],
    'S': [],
    'M': [],
    'R': [],
}

for p in posts:
    if "username" not in p or p["username"] != "IFTTT":
        continue
    for a in p["attachments"]:
        #Miscellaneous avoidance
        try:
            user_post_dic[a["text"]].append(a["title"])
        except:
            pass
        
users = user_post_dic.keys()
print([[u, len(user_post_dic[u])] for u in users])

`output`


[['Y', 864], ['S', 896], ['M', 240], ['R', 79]]

Main story

Preprocessing

Cleansing and word-separation

The posted message looks like this and the site title and URL are unnecessary, so delete it

Use Neovim in your browser's text area<http://Developers.IO|Developers.IO>

Security measures for front-end engineers/ #frontkansai 2019 - Speaker Deck

Japanese with matplotlib

Reintroduction to Modern JavaScript/ Re-introduction to Modern JavaScript - Speaker Deck

I didn't know how to use re, so I pushed it. In addition, it also uses MeCab for word-separation, and although the environment includes sudapipy etc., it is fast to use something that is familiar to the hand.

`python`


import MeCab, re
m = MeCab.Tagger("-Owakati")

_tag = re.compile(r'<.*?>')
_url = re.compile(r'(http|https)://([-\w]+\.)+[-\w]+(/[-\w./?%&=]*)?')
_title = re.compile(r'( - ).*$')
_par = re.compile(r'\(.*?\)')
_sla = re.compile(r'/.*$')
_qt = re.compile(r'"')
_sep = re.compile(r'\|.*$')
_twi = re.compile(r'(.*)on Twitter: ')
_lab = re.compile(r'(.*) ⇒ \(')
_last_par = re.compile(r'\)$')

def clean_text(text):
    text = text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))
    text = re.sub(_lab, '', text)
    text = re.sub(_tag, '', text)
    text = re.sub(_url, '', text)
    text = re.sub(_title, '', text)
    text = re.sub(_sla,  '', text)
    text = re.sub(_qt,  '', text)
    text = re.sub(_sep, '', text)
    text = re.sub(_twi, '', text)
    text = re.sub(_par, '', text)
    text = re.sub(_last_par, '', text)
    return text

p_all = []
m_all = []
for u in users:
    user_post_dic[u] = list(map(clean_text, p_dic[u]))
    m_all += [m.parse(p).split('\n')[0] for p in p_dic[u]]
    p_all += [u + '**' + p for p in user_post_dic[u]]

The reason why the user name is added to the beginning of each element in p_all is that the text disappears due to preprocessing and the index of list shifts, so it is tied in a painful way. (By the way, if you bookmark the URL as the article title)

For the time being it became beautiful

Use Neovim in your browser's text area
 
Security measures for front-end engineers
 
Japanese with matplotlib

Reintroduction to Modern JavaScript

Doc2Vec The text body that is the material when m_all acquires the distributed expression p_all is just a name

Parameters are not enthusiastically considered

`python`


from gensim import models

#Reference article: http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
class LabeledListSentence(object):
    def __init__(self, words_list, labels):
        self.words_list = words_list
        self.labels = labels

    def __iter__(self):
        for i, words in enumerate(self.words_list):
            yield models.doc2vec.TaggedDocument(words, ['%s' % self.labels[i]])

sentences = LabeledListSentence(m_all, p_all)
model = models.Doc2Vec(
    alpha=0.025,
    min_count=5,
    vector_size=100,
    epoch=20,
    workers=4
)
#Build vocabulary from the sentences you have
model.build_vocab(sentences)
model.train(
    sentences,
    total_examples=len(m_all),
    epochs=model.epochs
)

#Recall as the order may change
tags = model.docvecs.offset2doctag

PCA and drawing

It's my first time to use the PCA library, and even though I learned so much, it's amazing to use it in two lines

`python`


from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import japanize_matplotlib

vecs = [model.docvecs[p] for p in tags]
draw_scatter_plot(vecs, ls)

#Untie
tag_users = [p.split('**')[0] for p in tags]
tag_docs = [p.split('**')[1] for p in tags]

#It was difficult to find the same degree of color in 4 colors
cols = ["#0072c2", "#Fc6993", "#ffaa1c", "#8bd276" ]

#I forcibly wrote it in one line
clusters = [cols[0] if u == tag_users[0] else cols[1] if u == tag_users[1] else cols[2] if u == tag_users[2] else cols[3] for u in lab_users]

#2D because it is a plane
pca = PCA(n_components=2)
coords = pca.fit_transform(vecs)

fig, ax = plt.subplots(figsize=(16, 12))
x = [v[0] for v in coords]
y = [v[1] for v in coords]

#Do this loop to give a legend
for i, u in enumerate(set(tag_users)):
    x_of_u = [v for i, v in enumerate(x) if tag_users[i] == u]
    y_of_u = [v for i, v in enumerate(y) if tag_users[i] == u]
    ax.scatter(
        x_of_u,
        y_of_u,
        label=u,
        c=cols[i],
        s=30,
        alpha=1,
        linewidth=0.2,
        edgecolors='#777777'
    )

plt.legend(
    loc='upper right',
    fontsize=20,
    prop={'size':18,}
)
plt.show()

Made (repost)

Forecast before doing

result

Range and overlap are intuitively close to expectations

end

In the first place, there are many duplicate bookmarks, so I'm sorry I couldn't break up cleanly. If the data increases a little more, I would like to make recommendations by turning user inferences.

Sorry for being late (12/11/21: 00)