In this article, we will learn word2vec in Colaboratory and visualize it with Tensorboard.
--The output result of TensorBoard will be ** published on the Internet **, so please use only open data. -(If anyone knows how to implement TensorBoard's PROJECTOR without publishing it, please let me know) --Word2vec and Tensorboard will not be explained, so please study separately. -Word2Vec: The amazing power of the word vector that the inventor is surprised at -[Thorough introduction to TensorBoard to visualize all data](Thorough introduction to TensorBoard to visualize all data)
By learning the words in the novel with word2vec, we will verify whether the computer can correctly ** recognize "I am a cat" as "cat" **. (If recognized correctly, my word vector and cat's word vector will be close.)
From here, we will implement it using Google Colaboratory.
Install the necessary libraries on the Colaboratory. Use the following two.
--MeCab (+ mecab-ipadic-neologd dictionary) -MeCab uses a free software morphological analyzer to divide sentences into words. -By using the mecab-ipadic-neologd dictionary, you can correctly recognize proper nouns such as those on wikiedia. -(Reference) [I examined the effect of "mecab-ipadic-NEologd" which is strong against new words and named entities]( and-expressions /)
MeCab(+mecab-ipadic-neologd)Installation of
!apt-get -q -y install sudo file mecab libmecab-dev mecab-ipadic-utf8 git curl python-mecab > /dev/null
!git clone --depth 1 > /dev/null
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n > /dev/null 2>&1
!pip install mecab-python3 > /dev/null
Install Tensorboard X
!pip install tensorboardX
%load_ext tensorboard
Import the installed and standard libraries.
Library import
import re
import MeCab
import torch
from gensim.models import word2vec
from tensorboardX import SummaryWriter
from itertools import chain
Download and unzip the "I am a cat" zip file
Then, "wagahaiwa_nekodearu.txt" will appear, so read the file.
Data reading
f = open('./wagahaiwa_nekodearu.txt', 'r', encoding='shift-jis')
texts = [t.strip() for t in f.readlines()]
Let's output the file.
Output data
['I am a cat',
'Natsume Soseki',
'[About the symbols that appear in the text]',
'(Example) I "My Yes"',
'|: Symbol that identifies the beginning of a character string with ruby',
'(Example) Ichiban | Evil 《Doaku》',
'[#]: Input person note\u3000 Mainly explanation of external characters and designation of emphasis marks',
'(Numbers are JIS X 0213 area code points or Unicode, base page and number of lines)',
'(Example) * [# "Word + Making a mound", Level 4 2-88-74]',
'[]: Enclose the accent-decomposed European language',
'(Example) [Quid aliud est mulier nisi amicitiae& inimica〕',
'Please refer to the following URL for details on accent decomposition',
'[# 8 indentation] 1 [# "1" is the middle heading]',
'I am a cat. There is no name yet.',
'I have no idea where I was born. I remember only crying in a dim and damp place. I saw human beings for the first time here. Moreover, I heard later that it was the most evil race of human beings called Shosei. This student is a story that sometimes catches us, simmers them, and eats them. However, I didn't think anything at that time, so I didn't think it was particularly scary. However, when it was placed on his palm and lifted up, it just felt fluffy. It is probably the beginning of what is called a human being that calms down a little on the palm and sees the student's face. The feeling that I thought was strange at this time still remains. The face, which should be decorated with the first hair, is slippery and looks like a kettle. After that, I met a lot of cats, but I have never met such a one-wheeled cat. Not only that, the center of the face is too protruding. Then, from the inside of the hole, I sometimes blow smoke. Apparently my throat was so weak that I was really weak. It was around this time that I finally learned that this is a cigarette that humans drink.',
'I sat in a good mood for a while behind the palm of this student, but after a while I started driving at a very high speed. I don't know if the student will move or only I will move, but my eyes turn to the darkness. I feel sick. When I thought that it wouldn't help at all, I heard a loud noise and a fire broke out in my eyes. Until then, I remember it, but I don't know what to do or how much I try to come up with.',
'When I suddenly noticed, there was no student. There are many brothers, and I can't even see Piki. Even the mother of Kanjin, who is important, has disappeared. On top of that, it's bright and dark, unlike the places I've been up to now. I can't even open my eyes. If Hatena's Yoko is strange, it hurts very much when I take it out. I was suddenly abandoned into Sasahara from the top of the straw.',
・ ・ ・]
Looking at the output results, we can see the following.
We will perform these preprocessing in the next step.
Here, we will prepare a function for preprocessing sentences.
Function for sentence preprocessing
def preprocessTexts(texts):
# 1.Deleted the description of the novel before and after the sentence
texts = texts[23:-17]
# 2.Deleted ruby / delimiter / inputter note / accent-decomposed European text
signs = re.compile(r'(《.*?》)|(|)|([#.*?])|(〔.*?〕)|(\u3000)')
texts = [signs.sub('',t) for t in texts]
# 3.Divide the sentence with "."
texts = [t.split('。') for t in texts]
texts = list(chain.from_iterable(texts))
#Delete sentences of one character or less (because it is not a sentence)
texts = [t for t in texts if len(t) > 1]
return texts
texts = preprocessTexts(texts)
print('Number of sentences:', len(texts))
Number of sentences: 9058
Preprocessing has resulted in a list of sentences that word2vec can learn.
It's finally time to learn word2vec. Before you can learn, you need to divide the sentence (divided into words). Use the installed MeCab to separate each sentence.
Define a function to be divided.
Function for word-separation
def getWords(sentence, tokenizer, obj_pos=['all'], symbol=False):
Divide a sentence into words (separately)
sentence : str
Sentences to be divided
tokenizer : class
MeCab tokenizer
obj_pos : list of str, default ['all']
Part of speech to get
symbol : bool, default False
Whether to include symbols
words : list of str
Sentences divided by word
node = tokenizer.parseToNode(sentence)
words = []
while node:
results = node.feature.split(",")
pos = results[0] #Part of speech
word = results[6] #Uninflected words
if pos != "BOS/EOS" and (pos in obj_pos or 'all' in obj_pos) and (pos!='symbol' or symbol):
if word == '*':
word = node.surface
node =
return words
Use the function to divide the words. At this time, prepare two patterns of word-separation results.
Word-separation of sentences
#Set Tokenizer
path = "-d /usr/lib/x86_64-linux-gnu/mecab/dic/mecab-ipadic-neologd"
tokenizer = MeCab.Tagger(path)
#Divide the text
words = [getWords(t, tokenizer, symbol=True) for t in texts]
#Get a noun
nouns = [getWords(t, tokenizer, obj_pos=['noun']) for t in texts]
nouns = set(chain.from_iterable(nouns)) #A set of nouns that appear
From here, it is learning of word2vec. Set the parameters as follows.
Parameters | value | Description |
size | 300 | Number of dimensions of word vector |
sg | 1 | Algorithm to use(skip-gram:1, C-BOW:0) |
min_count | 2 | The number of appearances is min_Ignore words less than count |
seed | 0 | Random seed |
Generally, skip-gram, which is said to have high accuracy, is used, and the seed value is set to ensure reproducibility. size and min_count are rules of thumb ~~ (appropriate name) ~~. Other parameters are left at their defaults.
Parameter setting
size = 300
sg = 1
min_count = 2
seed = 0
Next is model learning. In order to make the visualization easy to understand, the vector of the learning result is set to the L2 norm.
Learning word2vec
model = word2vec.Word2Vec(words, size=size, min_count=min_count, sg=sg, seed=seed)
This completes the learning of word2vec. This time, in order to simplify visualization, we will narrow down to the top 500 nouns that appear frequently.
Storage of learning results
#Get a list of distributed expressions and words
word_vectors = model.wv.vectors
index2word = model.wv.index2word
#Get noun index
nouns_id = [i for i, n in enumerate(index2word) if n in nouns]
#Extract the top 500 words whose part of speech is a noun
word_vectors = word_vectors[nouns_id][:500]
index2word = [index2word[i] for i in nouns_id][:500]
Finally, the learned word2vec is visualized by Tensorbord. Outputs a file for visualization. You can easily output by using the library TensorbordX.
Output file for running Tensorboard
writer = SummaryWriter('./runs')
writer.add_embedding(torch.FloatTensor(word_vectors), metadata=index2word)
Execute the output file. You can run Tensorboard in Colaboratory by using ngrok.
Run TensorBoard
LOG_DIR = './runs'
'tensorboard --logdir={} --host --port 6006 &'
get_ipython().system_raw('./ngrok http 6006 &')
!curl -s http://localhost:4040/api/tunnels | python3 -c \
"import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"
When you execute the above code, a URL like "" will be output, so you can see the TensorBoard when you access it! That's it!
When you access it, it looks like the following. Please wait for a while or change "** IN ACTIVE " in the upper right to " PROJECTOR **".
Then, the result summarized in 3D by PCA will appear.
Please change from PCA on the left to T-SNE. It is further aggregated and similar words are put together. This is a screenshot, but it's really interesting because you can see the learning process moving.
When the learning has converged, let's look at the similarity (distance) between "I" and "Cat", which is the main subject. Enter "I" from the search on the right and find the "I" point.
There is "I" in the lower left and "Cat" in the upper right. The distance is not very similar to 0.317, but it turns out to be reasonably similar. It is thought that this happened because "I" and "cat" rarely appear in the same context.
Looking at the words with high similarity of each word, Words that are close to "I" are personally named "he" and "they" at the top, and words that are closest to "cat" are "humans" and are grouped together by animals. Looking at similar words, it seems that they are learning well.
This time, we learned word2vec on Colaboratory and visualized it with TensorBoard. It is very convenient to be able to implement it easily without building an environment. It would be even more convenient if PROJECTOR could be visualized on Colaboratoy without publishing it on the Internet! I hope that PROJECTOR will be visualized.
I have omitted detailed explanations, so please check the reference articles for detailed explanations of terms and libraries.
[Data visualization] Run TensorBoard Projector with Keras and Colaboratory [Until using Mecab-ipadic-Neologd with Google Colaboratory]( BD% BF% E3% 81% 86% E3% 81% BE% E3% 81% A7 /) Gemsim word2vec option list ngrok is too convenient
Recommended Posts