This is the article on the 11th day of the Advent calendar.
A distributed representation of the article titles bookmarked by everyone (4 people) acquired and visualized
If you bookmark it, IFTTT will pick it up and spit it out to slack, so process it from there
--Environment -Link to Dockerfile
--R-kun --Many gadgets and security systems --Number of bookmarks 79 --Mr. Y ――The widest range of the four ――Actually, only this user is posting the selected ones for the purpose of sharing with everyone. --Number of bookmarks 864 --M-kun --Web and machine learning, etc. --240 bookmarks --S (self) ――In addition to the Web, machine learning, and gadgets, I throw things like "Saury is not caught this year" --Number of bookmarks 896
Range and overlap are intuitively close to expectations
Details are omitted, but the mechanism itself is completed according to the flow shown in the figure below, but it is necessary to enter the URL to receive RSS Feed between the 4th and 5th frames, and this time it is a Hatena bookmark, so http://b.hatena It becomes .ne.jp/<username>/rss
Like this
That's not bad (rather, you can do both), but you can feel free to comment like this in the community.
Since you can customize the post, you can use it for play like this time, and using the Slack command takes up a lot of space, which is a problem
These two types seem to be easy to do
Either way, you need a token, so get it from here
$ wget https://github.com/PyYoshi/slack-dump/releases/download/v1.1.3/slack-dump-v1.1.3-linux-386.tar.gz
$ tar -zxvf slack-dump-v1.1.3-linux-386.tar.gz
$ linux-386/slack-dump -t=<token> <channel>
Take it with DM and move it to another place because it is an obstacle
python
import zipfile, os
os.mkdir('dumps')
with zipfile.ZipFile('./zipfile_name') as z:
for n in z.namelist():
if 'channel_name' in n:
z.extract(n, './dumps')
Open the file and get the contents, because it is by date, make all one
python
import json, glob
posts = []
files = glob.glob('./dumps/channel/<channel_name>/*.json'.format(dirname))
for file in files:
with open(file) as f:
posts += json.loads(f.read())
Extract the Message and associate the article title with the user name (this area depends on the settings in IFTTT)
python
user_post_dic = {
'Y': [],
'S': [],
'M': [],
'R': [],
}
for p in posts:
if "username" not in p or p["username"] != "IFTTT":
continue
for a in p["attachments"]:
#Miscellaneous avoidance
try:
user_post_dic[a["text"]].append(a["title"])
except:
pass
users = user_post_dic.keys()
print([[u, len(user_post_dic[u])] for u in users])
output
[['Y', 864], ['S', 896], ['M', 240], ['R', 79]]
The posted message looks like this and the site title and URL are unnecessary, so delete it
Use Neovim in your browser's text area<http://Developers.IO|Developers.IO>
Security measures for front-end engineers/ #frontkansai 2019 - Speaker Deck
Japanese with matplotlib
Reintroduction to Modern JavaScript/ Re-introduction to Modern JavaScript - Speaker Deck
I didn't know how to use re
, so I pushed it.
In addition, it also uses MeCab for word-separation, and although the environment includes sudapipy etc., it is fast to use something that is familiar to the hand.
python
import MeCab, re
m = MeCab.Tagger("-Owakati")
_tag = re.compile(r'<.*?>')
_url = re.compile(r'(http|https)://([-\w]+\.)+[-\w]+(/[-\w./?%&=]*)?')
_title = re.compile(r'( - ).*$')
_par = re.compile(r'\(.*?\)')
_sla = re.compile(r'/.*$')
_qt = re.compile(r'"')
_sep = re.compile(r'\|.*$')
_twi = re.compile(r'(.*)on Twitter: ')
_lab = re.compile(r'(.*) ⇒ \(')
_last_par = re.compile(r'\)$')
def clean_text(text):
text = text.translate(str.maketrans({chr(0xFF01 + i): chr(0x21 + i) for i in range(94)}))
text = re.sub(_lab, '', text)
text = re.sub(_tag, '', text)
text = re.sub(_url, '', text)
text = re.sub(_title, '', text)
text = re.sub(_sla, '', text)
text = re.sub(_qt, '', text)
text = re.sub(_sep, '', text)
text = re.sub(_twi, '', text)
text = re.sub(_par, '', text)
text = re.sub(_last_par, '', text)
return text
p_all = []
m_all = []
for u in users:
user_post_dic[u] = list(map(clean_text, p_dic[u]))
m_all += [m.parse(p).split('\n')[0] for p in p_dic[u]]
p_all += [u + '**' + p for p in user_post_dic[u]]
The reason why the user name is added to the beginning of each element in p_all
is that the text disappears due to preprocessing and the index of list shifts, so it is tied in a painful way.
(By the way, if you bookmark the URL as the article title)
For the time being it became beautiful
Use Neovim in your browser's text area
Security measures for front-end engineers
Japanese with matplotlib
Reintroduction to Modern JavaScript
Doc2Vec
The text body that is the material when m_all
acquires the distributed expression
p_all
is just a name
Parameters are not enthusiastically considered
python
from gensim import models
#Reference article: http://qiita.com/okappy/items/32a7ba7eddf8203c9fa1
class LabeledListSentence(object):
def __init__(self, words_list, labels):
self.words_list = words_list
self.labels = labels
def __iter__(self):
for i, words in enumerate(self.words_list):
yield models.doc2vec.TaggedDocument(words, ['%s' % self.labels[i]])
sentences = LabeledListSentence(m_all, p_all)
model = models.Doc2Vec(
alpha=0.025,
min_count=5,
vector_size=100,
epoch=20,
workers=4
)
#Build vocabulary from the sentences you have
model.build_vocab(sentences)
model.train(
sentences,
total_examples=len(m_all),
epochs=model.epochs
)
#Recall as the order may change
tags = model.docvecs.offset2doctag
It's my first time to use the PCA library, and even though I learned so much, it's amazing to use it in two lines
python
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import japanize_matplotlib
vecs = [model.docvecs[p] for p in tags]
draw_scatter_plot(vecs, ls)
#Untie
tag_users = [p.split('**')[0] for p in tags]
tag_docs = [p.split('**')[1] for p in tags]
#It was difficult to find the same degree of color in 4 colors
cols = ["#0072c2", "#Fc6993", "#ffaa1c", "#8bd276" ]
#I forcibly wrote it in one line
clusters = [cols[0] if u == tag_users[0] else cols[1] if u == tag_users[1] else cols[2] if u == tag_users[2] else cols[3] for u in lab_users]
#2D because it is a plane
pca = PCA(n_components=2)
coords = pca.fit_transform(vecs)
fig, ax = plt.subplots(figsize=(16, 12))
x = [v[0] for v in coords]
y = [v[1] for v in coords]
#Do this loop to give a legend
for i, u in enumerate(set(tag_users)):
x_of_u = [v for i, v in enumerate(x) if tag_users[i] == u]
y_of_u = [v for i, v in enumerate(y) if tag_users[i] == u]
ax.scatter(
x_of_u,
y_of_u,
label=u,
c=cols[i],
s=30,
alpha=1,
linewidth=0.2,
edgecolors='#777777'
)
plt.legend(
loc='upper right',
fontsize=20,
prop={'size':18,}
)
plt.show()
--R-kun --Many gadgets and security systems --Number of bookmarks 79 --Mr. Y ――The widest range of the four ――Actually, only this user is posting the selected ones for the purpose of sharing with everyone. --Number of bookmarks 864 --M-kun --Web and machine learning, etc. --240 bookmarks --S (self) ――In addition to the Web, machine learning, and gadgets, I throw things like "Saury is not caught this year" --Number of bookmarks 896
Range and overlap are intuitively close to expectations
In the first place, there are many duplicate bookmarks, so I'm sorry I couldn't break up cleanly. If the data increases a little more, I would like to make recommendations by turning user inferences.
Sorry for being late (12/11/21: 00)
Recommended Posts