Purpose of this article

I want to execute processing using word2vec that has already been learned in Azure Databricks. I used to use word2vec from Python in my local environment, but I was addicted to it because I thought it would work with copy and paste, so I'll write it down.

If you write the conclusion first, **-Mount the BLOB that uploaded the trained model on databricks and load it ** **-Note that if you do not use the with open command when loading, you will get a "File not found" error **

word2vec overview

Word2vec can handle word similarity mathematically

As the name implies, it converts words into vectors. A super important technology that is indispensable for natural language processing. Replace a simple string of words with a vector so that it can be handled mathematically.

"Rice" "Machine learning" "Deep learning" 　　↓　　　　　↓　　　　　　　　↓

This makes it possible to mathematically calculate the similarity between words as a distance in space **. It can be mathematically defined that the words "machine learning" and "deep learning" have similar meanings.

It's hard to learn by yourself

The basic premise of word2vec is the idea that the meaning of a word is formed by the surrounding words. This is called the "distribution hypothesis".

To put it plainly, the meaning of a word can be inferred by looking at the words around it.

For example, suppose you have the following sentence: ** ・ [Machine learning] technology is indispensable for the realization of artificial intelligence. ** ** Even if the word "machine learning" is unknown, it is speculated that this may be a technology related to artificial intelligence.

Similarly, there may be a statement like this: ** ・ The technology called [deep learning] has dramatically accelerated research on artificial intelligence. ** **

By learning such a large number of sentences, it becomes possible to predict the meaning of unknown words. It can also be seen that [machine learning] and [deep learning], in which similar words appear in the surroundings, seem to be semantically similar.

However, these learning requires reading a large amount of documents, and the cost of learning is high. Therefore, it is basic to use a trained model first.

Steps to use word2vec with Azure Databricks

Preparation

Create Azure Databricks resources
Create a container with Storage Account
Download the trained model and store it in a container

Execution

Mount the container on Azure databricks
Load the model with gensim
Run word2vec

Preparation

1. Create Azure Databricks resources

There is nothing to be careful about. You can create it from the azure portal normally.

2. Create a container with Storage Account

Again, there is nothing to be careful about. Create a container. The public access level can be private.

3. Download the trained model and store it in a container

Download the trained model by referring to the following article. (By the way, I used fastText)

List of ready-to-use word embed vectors https://qiita.com/Hironsan/items/8f7d35f0a36e0f99752c

Upload the downloaded "model.vec" file to the created container.

Run

From here, operations on the Databricks notebook.

1. Mount the container on Azure databricks

This article was very easy to understand.

Analyze the data in the Blob with a query! https://tech-blog.cloud-config.jp/2020-04-30-databricks-for-ml/

`python`


mount_name= "(Arbitrary mount destination directory name)"
storage_account_name = "(Storage account name)"
container_name = "(Container name)"
storage_account_access_key = "(Storage account access key)"

mount_point = "/mnt/" + mount_name
source = "wasbs://" + container_name + "@" + storage_account_name + ".blob.core.windows.net"
conf_key = "fs.azure.account.key." + storage_account_name + ".blob.core.windows.net"


mounted = dbutils.fs.mount(
  source=source,
  mount_point = mount_point,
  extra_configs = {conf_key: storage_account_access_key}
)

2. Load the model with gensim

To use gensim, install with PyPI in the cluster in advance.

`python`


import gensim
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format("mount_name/container_name/model.vec", binary=False)

When I run it above, I get an error for some reason. Why is the mount properly made? (By the way, if it is local, it works properly.)

FileNotFoundError: [Errno 2] No such file or directory:

So, use the with open command as shown below to receive it once with f_read and then load it.

`python`


import gensim
with open("mount_name/container_name/model.vec", "r") as f_read:
  word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(f_read, binary=False)

Databricks File System (DBFS)-Local File api https://docs.microsoft.com/ja-jp/azure/databricks/data/databricks-file-system#local-file-apis

This time it was a success.

3. Run word2vec

Try it out. Try to put out the word closest to "Japanese".

`python`


word2vec_model.most_similar(positive=['Japanese'])

Out [3]: [('Chinese', 0.7151615619659424), ('Japanese', 0.5991291999816895), ('Foreign', 0.5666396617889404), ('Japanese', 0.5619238018989563), ('Korean', 0.5443094968795776), ('Overseas Chinese', 0.5377858877182007), ('Resident in Japan', 0.5263140201568604), ('Chinese', 0.5200497508049011), ('Residence', 0.5198684930801392), ('International Student', 0.5194666981697083)]

Use Python and word2vec (learned) with Azure Databricks

Purpose of this article

word2vec overview

Word2vec can handle word similarity mathematically

It's hard to learn by yourself

Steps to use word2vec with Azure Databricks

Preparation

1. Create Azure Databricks resources

2. Create a container with Storage Account

3. Download the trained model and store it in a container

Run

1. Mount the container on Azure databricks

python

2. Load the model with gensim

python

python

3. Run word2vec

python

`python`

`python`

`python`

`python`