I want to execute processing using word2vec that has already been learned in Azure Databricks. I used to use word2vec from Python in my local environment, but I was addicted to it because I thought it would work with copy and paste, so I'll write it down.
If you write the conclusion first, **-Mount the BLOB that uploaded the trained model on databricks and load it ** **-Note that if you do not use the with open command when loading, you will get a "File not found" error **
As the name implies, it converts words into vectors. A super important technology that is indispensable for natural language processing. Replace a simple string of words with a vector so that it can be handled mathematically.
"Rice" "Machine learning" "Deep learning" ↓ ↓ ↓
This makes it possible to mathematically calculate the similarity between words as a distance in space **. It can be mathematically defined that the words "machine learning" and "deep learning" have similar meanings.
The basic premise of word2vec is the idea that the meaning of a word is formed by the surrounding words. This is called the "distribution hypothesis".
To put it plainly, the meaning of a word can be inferred by looking at the words around it.
For example, suppose you have the following sentence: ** ・ [Machine learning] technology is indispensable for the realization of artificial intelligence. ** ** Even if the word "machine learning" is unknown, it is speculated that this may be a technology related to artificial intelligence.
Similarly, there may be a statement like this: ** ・ The technology called [deep learning] has dramatically accelerated research on artificial intelligence. ** **
By learning such a large number of sentences, it becomes possible to predict the meaning of unknown words. It can also be seen that [machine learning] and [deep learning], in which similar words appear in the surroundings, seem to be semantically similar.
However, these learning requires reading a large amount of documents, and the cost of learning is high. Therefore, it is basic to use a trained model first.
Preparation
Execution
There is nothing to be careful about. You can create it from the azure portal normally.
Again, there is nothing to be careful about. Create a container. The public access level can be private.
Download the trained model by referring to the following article. (By the way, I used fastText)
List of ready-to-use word embed vectors https://qiita.com/Hironsan/items/8f7d35f0a36e0f99752c
Upload the downloaded "model.vec" file to the created container.
From here, operations on the Databricks notebook.
This article was very easy to understand.
Analyze the data in the Blob with a query! https://tech-blog.cloud-config.jp/2020-04-30-databricks-for-ml/
python
mount_name= "(Arbitrary mount destination directory name)"
storage_account_name = "(Storage account name)"
container_name = "(Container name)"
storage_account_access_key = "(Storage account access key)"
mount_point = "/mnt/" + mount_name
source = "wasbs://" + container_name + "@" + storage_account_name + ".blob.core.windows.net"
conf_key = "fs.azure.account.key." + storage_account_name + ".blob.core.windows.net"
mounted = dbutils.fs.mount(
source=source,
mount_point = mount_point,
extra_configs = {conf_key: storage_account_access_key}
)
python
import gensim
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format("mount_name/container_name/model.vec", binary=False)
When I run it above, I get an error for some reason. Why is the mount properly made? (By the way, if it is local, it works properly.)
FileNotFoundError: [Errno 2] No such file or directory:
So, use the with open command as shown below to receive it once with f_read and then load it.
python
import gensim
with open("mount_name/container_name/model.vec", "r") as f_read:
word2vec_model = gensim.models.KeyedVectors.load_word2vec_format(f_read, binary=False)
Databricks File System (DBFS)-Local File api https://docs.microsoft.com/ja-jp/azure/databricks/data/databricks-file-system#local-file-apis
This time it was a success.
Try it out. Try to put out the word closest to "Japanese".
python
word2vec_model.most_similar(positive=['Japanese'])
Out [3]: [('Chinese', 0.7151615619659424), ('Japanese', 0.5991291999816895), ('Foreign', 0.5666396617889404), ('Japanese', 0.5619238018989563), ('Korean', 0.5443094968795776), ('Overseas Chinese', 0.5377858877182007), ('Resident in Japan', 0.5263140201568604), ('Chinese', 0.5200497508049011), ('Residence', 0.5198684930801392), ('International Student', 0.5194666981697083)]
Recommended Posts