Databricks is a service for creating apps that process large amounts of data in parallel. Developed by the Apache Spark developers as a managed service for Apache Spark. I've been studying for the past few days, but I'll write down the points that caught my attention.
In a nutshell, Apache Spark automatically converts code that processes tabular data into parallel processing and executes it in parallel. Developers can process huge data in parallel as if they were writing code in Pandas on Jupyter. A library related to machine learning is also included, so everything from data preprocessing to analysis and prediction can be done with Spark.
Also, the difference from its predecessor (?) Hadoop is that the storage is separated. By specializing in data processing, Spark can be used in combination with various other storages.
I will record the following because there was something difficult to understand with Databricks.
CLI
It's not confusing, but it's quite annoying without the CLI, so installation is essential. Use pip3 for installation. You can use it by setting the connection information with databricks configure
.
#Setting
$ pip3 install databricks-cli
$ databricks configure --token
Databricks Host (should begin with https://): https://hogehoge.cloud.databricks.com/
Token: (Enter the token created by GUI)
#Operation check
$ databricks fs ls
The databricks
command has various subcommands, but databricks fs
comes with dbfs
as an abbreviation from the beginning.
Secret
Passwords etc. are saved in Secret for each scope.
Create scope
databricks secrets create-scope --scope astro_snowflake
scope display
databricks secrets list-scopes
Add secrets to scope
databricks secrets put --scope hoge_scope --key User
databricks secrets put --scope hoge_scope --key Password
secrets display
databricks secrets list --scope astro_snowflake
There are two files used by Databricks as follows.
databricks fs
command.databricks workspace
command.You can't access Workspace directly programmatically!
Databricks allows you to use Python or SQL to access the table. To use the same table in another language, you need to register the table as a Table where you can see it from SQL.
Tables have global, which can be accessed from anywhere, and local, which can only be accessed from the same Notebook.
Register the Python DataFrame as a local table called "temp_table_name". You can now refer to it from SQL.
df.createOrReplaceTempView("temp_table_name")
Register the Python DataFrame as a global table called "global_table_name". You can now refer to it from SQL. The global table can be referenced from Data in the Web UI.
df.write.format("parquet").saveAsTable("global_table_name")
Read the table registered with the name "temp_table_name" as a Python DataFrame.
temp_table = spark.table("temp_table_name")
The global table is stored on DBFS. The location where it is saved can be confirmed by the location that appears in DESCRIBE DETAIL
.
DESCRIBE DETAIL `global_table_name`
For example dbfs: / user / hive / warehouse / global_table_name
https://docs.databricks.com/notebooks/notebooks-use.html
Magic command