I want to perform simple natural language processing (morphological analysis + α) using MeCab in the pre-processing of Azure Data Factory. It would be convenient if you could implement it as a function and call it later from various services such as LogicApps. So I considered two implementation methods.
Azure Functions seems to be sufficient for the time being, but assuming a situation where heavy processing such as machine learning will be performed in the future, I also tried Databricks because I wanted to understand the service of Databricks as well.
If you write the conclusion first, ** ・ For beginners of Azure Databricks, the following Microsoft Learn (free) is easy to understand **
Run Data Engineering with Azure Databricks https://docs.microsoft.com/ja-jp/learn/paths/data-engineering-with-databricks/
** ・ MeCab can be used by installing "mecab-python3" on the cluster with PyPI ** **-Complete by accessing the Azure portal and Databricks with a browser, no local environment settings required **
There are many points of lack of understanding, so please point out any mistakes. Correct and add as appropriate.
Apache Spark-based analytics platform. Computing resources can be scaled out and distributed as needed.
There are some parts that are a little difficult to understand, but the charges are roughly for the following two.
· Virtual machines (VMs) provisioned in the cluster · Databricks units (DBU) based on the selected VM instance
There are also small charges for managed disks, blob storage, and public IP addresses.
Azure Databricks pricing https://azure.microsoft.com/ja-jp/pricing/details/databricks/
By the way, if you use the 14-day "trial version", you will be exempt from charging for DBU. On the other hand, be aware that VMs will be charged as usual.
With Databricks (not Azure), you can try it for 14 days, including computing resources, for free. The interface is the same for Azure Databricks and Databricks, so you can try this. https://databricks.com/try-databricks
You can choose from Python, Scala, SQL, and R when you create your notebook. By using the Databricks Magic command, it is possible to mix multiple languages in a notebook. (If you write% python at the beginning of a cell, that cell will be executed by python, etc.)
If you search and create from the Azure portal normally, there is no particular hesitation.
I'm wondering whether to set the price level to Standard or Premium, but it seems that it is possible to change the price level later while keeping the notebook, user, and cluster configuration, so I'm not too nervous Good. In Premium, access control, authentication, and audit log functions will be enhanced.
Azure Databricks workspace upgrade or downgrade https://docs.microsoft.com/ja-jp/azure/databricks/administration-guide/account-settings/account#upgrade-or-downgrade-an-azure-databricks-workspace
Also, as mentioned above, if you select the trial version and use it all the time, you will be charged firmly with the VM fee, so be careful. (DBU billing is exempt)
After deploying Databricks, go to the resource and launch the workspace. Select Clusters from the Databricks screen and Create Cluster.
Create a Cluster by setting the type and number of VMs to be provisioned.
Library can be installed from the details screen for the created Cluster.
After that, you can install the package with PyPI etc.
did it.
Create a Notebook in Python from Workspace> Create> Notebook. After that, you can morphologically analyze with import MeCab.
Compared to using Python with Functions, it was very easy to set up because everything was completed on the Web. Even when managing with multiple people, it is easy because there is no need to match the local environment.
The cost of the instance "DS3 v2" specified by default is as follows. You will be charged for the time (in minutes) that the instance is up.
It scales out under heavy load, for example, doubling the number of compute nodes (Workers) doubles the billing amount. (Both VM and DBU cost double)
Azure Databricks pricing https://azure.microsoft.com/ja-jp/pricing/details/databricks/
Recommended Posts