I implemented Airflow in my personal study, so I will write what I noticed at that time [^ 1]. I hope fewer people are addicted to similar problems.
--Use Local Executor --MySQL container and Airflow container --Access Redshift from Airflow
--To write a Dockerfile yourself ――By the way, ʻentrypoint.sh` --For learning purposes, please refrain from puckel / docker-airflow -Insanely helpful --The official Docker image master and latest are version 2.0 and development version [^ 2]
AIRFLOW_EXTRAS It is a plug-in for extending airflow, and there is access to GCP from DB system such as MySQL. The official Dockerfile says:
ARG AIRFLOW_EXTRAS="async,aws,azure,celery,dask,elasticsearch,gcp,kubernetes,mysql,postgres,redis,slack,ssh,statsd,virtualenv"
crypto
is virtually required as it is needed to generate FERNET_KEY
.
I use MySQL for the backend DB and psycopg2 for connecting to Redshift, so I also need something related to these.
entrypoint.sh
As you can see in the docs (https://airflow.apache.org/docs/stable/howto/secure-connections.html) and puckel, you can now generate a FERNET_KEY
to encrypt your connection. To. Safer than solid writing in ʻairflow.cfg`.
: "${AIRFLOW__CORE__FERNET_KEY:=${FERNET_KEY:=$(python -c "from cryptography.fernet import Fernet; FERNET_KEY = Fernet.generate_key().decode(); print(FERNET_KEY)")}}"
Even if you do depends_on: mysql
in docker-compose.yml
, it just waits for the container to start and does not confirm that the DB starts. puckel uses the nc
command in ʻentrypoint.sh` to check if a connection to the DB is established. This kind of thing is very helpful.
wait_for_port() {
local name="$1" host="$2" port="$3"
local j=0
while ! nc -z "$host" "$port" >/dev/null 2>&1 < /dev/null; do
j=$((j+1))
if [ $j -ge $TRY_LOOP ]; then
echo >&2 "$(date) - $host:$port still not reachable, giving up"
exit 1
fi
echo "$(date) - waiting for $name... $j/$TRY_LOOP"
sleep 5
done
}
Set the following environment variables (Reference). These are the settings on the DB side, but they are accessed from Airflow using the same user password.
my.cnf
Setting ʻexplicit_defaults_for_timestamp = 1` when using MySQL, as described in Airflow's database backend Description (https://airflow.apache.org/docs/stable/howto/initialize-database.html) Is necessary.
In addition, add settings for handling multi-byte characters.
[mysqld]
character-set-server=utf8mb4
explicit_defaults_for_timestamp=1
[client]
default-character-set=utf8mb4
AIRFLOW__CORE__SQL_ALCHEMY_CONN
The default is sqlite, but change it according to the DB and driver. For how to write, refer to [Document] of SqlAlchemy (https://docs.sqlalchemy.org/en/13/core/engines.html)
mysql+mysqldb://user:password@host:port/db
postgresql+psycopg2://user:password@host:port/db
The host is the one specified by container_name
in docker-compose.yml
, and the port is basically 3306 for MySQL and 5432 for PostgreSQL. The user name and DB name are the ones set above.
As described in Documentation, you can write settings in multiple places, and environment variables take precedence. To. I want to use environment variables for AWS authentication. If you want to make it static, such as accessing the DB, you may use ʻairflow.cfg`. At the production level, the members should set the rules properly.
The higher ones have priority.
This is also quite various. This should be prioritized as the one below.
Dockerfile
: Use the one with few changes, like the defaultdocker-compose.yml
: Can be modified for other containers. More flexible..env
file: If specified as ʻenv_file in
docker-compose.yml`, it will be read when the container starts. Write here the authentication system that you do not want to leave in Git.docker-compose.yml
Since the same settings are used, it is easier to manage and maintain if the description location is the same.
version: "3.7"
services:
mysql:
image: mysql:5.7
container_name: mysql
environment:
- MYSQL_ROOT_PASSWORD=password
- MYSQL_USER=airflow
- MYSQL_PASSWORD=airflow
- MYSQL_DATABASE=airflow
volumes:
- ./mysql.cnf:/etc/mysql/conf.d/mysql.cnf:ro
ports:
- "3306:3306"
airflow:
build: .
container_name: airflow
depends_on:
- mysql
environment:
- AIRFLOW_HOME=/opt/airflow
- AIRFLOW__CORE__LOAD_EXAMPLES=False
- AIRFLOW__CORE__EXECUTOR=LocalExecutor
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=mysql+mysqldb://airflow:airflow@mysql:3306/airflow
- MYSQL_PORT=3306
- MYSQL_HOST=mysql
#Abbreviation
.env
fileAside from Redshift, AWS access keys and secret keys are highly confidential, so I don't want to write them in docker-compose.yml
or ʻentrypoint.sh. ʻAirflow.cfg
has room for consideration, but in reality it may be a consultation with the development team.
** For the time being, it's not modern to type in the GUI. ** **
When writing, refer to Documentation and write as follows.
Conn Id | Conn Type | Login | Password | Host | Port | Schema | Environment Variable |
---|---|---|---|---|---|---|---|
redshift_conn_id | postgres | awsuser | password | your-cluster-host | 5439 | dev | AIRFLOW_CONN_REDSHIFT_CONN_ID=postgres://awsuser:password@your-cluster-host:5439/dev |
aws_conn_id | aws | your-access-key | your-secret-key | AIRFLOW_CONN_AWS_CONN_ID=aws://your-access-key:your-secret-key@ |
Even if the ID is lowercase, the environment variable name will be uppercase.
With AWS keys, you need to add @ at the end even if you don't have a host. Otherwise, an error will occur. Also, if the key contains a colon or slash, it will not parse well, so it is better to regenerate the key.
On the contrary, I wanted to know the URI format of the connection entered in the GUI, etc. You can output as follows.
from airflow.hooks.base_hook import BaseHook
conn = BaseHook.get_connection('postgres_conn_id')
print(f"AIRFLOW_CONN_{conn.conn_id.upper()}='{conn.get_uri()}'")
As with connections, you can Set Key Value. As a method,
.py
code.For code,
from airflow.models import Variable
Variable.set(key="foo", value="bar")
For environment variables
Key | Value | Environment Variable |
---|---|---|
foo | bar | AIRFLOW_VAR_FOO=bar |
Even if the key is lowercase, the environment variable name will be uppercase.
I would like to introduce Repository.
[^ 1]: I'm a Data Engineer from Udacity. I touched Cassandra, Redshift, Spark, Airflow. It was said that it would take 5 months, but it ended in 3 months, so it seems better to sign a monthly contract. Also, you will get 50% off on a regular basis, so it is recommended that you register aiming for that. ~~ Otherwise Takasugi ~~
[^ 2]: When I touched ʻapache / airflow: 11.1010while writing the article, it seemed to be relatively smooth. If you execute
docker run -it --name test -p 8080 -d apache / airflow: 11.1010" ", it will start with
bash open, so it is flexible such as
docker exec test airflow initdb`. You can operate it.
Recommended Posts