Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quick start for Airflow on Docker #13660

Merged
merged 17 commits into from
Jan 26, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ airflow/www/static/dist
airflow/www_rbac/static/coverage/
airflow/www_rbac/static/dist/

logs/
/logs/
airflow-webserver.pid

# Byte-compiled / optimized / DLL files
Expand Down
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -522,7 +522,7 @@ repos:
- https://raw.githubusercontent.com/compose-spec/compose-spec/master/schema/compose-spec.json
language: python
pass_filenames: true
files: scripts/ci/docker-compose/.+.yml
files: ^scripts/ci/docker-compose/.+\.ya?ml$|docker-compose.ya?ml$
require_serial: true
additional_dependencies: ['jsonschema==3.2.0', 'PyYAML==5.3.1', 'requests==2.25.0']
- id: json-schema
Expand Down
1 change: 1 addition & 0 deletions airflow/cli/commands/db_command.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ def upgradedb(args):
"""Upgrades the metadata database"""
print("DB: " + repr(settings.engine.url))
db.upgradedb()
print("Upgrades done")


def check_migrations(args):
Expand Down
45 changes: 39 additions & 6 deletions docs/apache-airflow/concepts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,38 @@ Concepts
The Airflow platform is a tool for describing, executing, and monitoring
workflows.

.. _architecture:

Basic Airflow architecture
''''''''''''''''''''''''''

Primarily intended for development use, the basic Airflow architecture with the Local and Sequential executors is an
excellent starting point for understanding the architecture of Apache Airflow.

.. image:: img/arch-diag-basic.png


There are a few components to note:

* **Metadata Database**: Airflow uses a SQL database to store metadata about the data pipelines being run. In the
diagram above, this is represented as Postgres which is extremely popular with Airflow.
Alternate databases supported with Airflow include MySQL.

* **Web Server** and **Scheduler**: The Airflow web server and Scheduler are separate processes run (in this case)
on the local machine and interact with the database mentioned above.

* The **Executor** is shown separately above, since it is commonly discussed within Airflow and in the documentation, but
in reality it is NOT a separate process, but run within the Scheduler.

* The **Worker(s)** are separate processes which also interact with the other components of the Airflow architecture and
the metadata repository.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are workers required in case of every executor?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We support only celery executor in this guide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may have missed it but I could not found where we say it. In fact we say something totally different than CeleryExecutor:

Primarily intended for development use, the basic Airflow architecture with the Local and Sequential executors is an excellent starting point for understanding the architecture of Apache Airflow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. And that's not my text. I only moved this paragraph from another place, but I will check it more carefully.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very beginning of this section it says this is the architecture for LocalExecutor and SeqentialExecutor, and both of these executors use a separate process to run the DAG code, so everything is correct.

Primarily intended for development use, the basic Airflow architecture with the Local and Sequential executors is an
excellent starting point for understanding the architecture of Apache Airflow.


* ``airflow.cfg`` is the Airflow configuration file which is accessed by the Web Server, Scheduler, and Workers.

* **DAGs** refers to the DAG files containing Python code, representing the data pipelines to be run by Airflow. The
location of these files is specified in the Airflow configuration file, but they need to be accessible by the
Web Server, Scheduler, and Workers.

Core Ideas
''''''''''

Expand Down Expand Up @@ -194,10 +226,10 @@ Example DAG with decorator:

.. _concepts:executor_config:

executor_config
===============
``executor_config``
===================

The executor_config is an argument placed into operators that allow airflow users to override tasks
The ``executor_config`` is an argument placed into operators that allow airflow users to override tasks
before launch. Currently this is primarily used by the :class:`KubernetesExecutor`, but will soon be available
for other overrides.

Expand Down Expand Up @@ -1545,7 +1577,8 @@ This example illustrates some possibilities


Packaged DAGs
'''''''''''''
=============

While often you will specify DAGs in a single ``.py`` file it might sometimes
be required to combine a DAG and its dependencies. For example, you might want
to combine several DAGs together to version them together or you might want
Expand Down Expand Up @@ -1594,8 +1627,8 @@ do the same, but then it is more suitable to use a virtualenv and pip.
pure Python modules can be packaged.


.airflowignore
''''''''''''''
``.airflowignore``
==================

A ``.airflowignore`` file specifies the directories or files in ``DAG_FOLDER``
or ``PLUGINS_FOLDER`` that Airflow should intentionally ignore.
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/howto/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@
How-to Guides
=============

Setting up the sandbox in the :doc:`../start` section was easy;
Setting up the sandbox in the :doc:`/start/index` section was easy;
building a production-grade environment requires a bit more work!

These how-to guides will step you through common tasks in using and
Expand Down
2 changes: 1 addition & 1 deletion docs/apache-airflow/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@ unit of work and continuity.
Home <self>
project
license
start
start/index
installation
upgrading-to-2
upgrade-check
Expand Down
3 changes: 3 additions & 0 deletions docs/apache-airflow/redirects.txt
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ howto/write-logs.rst logging-monitoring/logging-tasks.rst
metrics.rst logging-monitoring/metrics.rst
howto/tracking-user-activity.rst logging-monitoring/tracking-user-activity.rst

# Quick start
start.rst start/index.rst

# References
cli-ref.rst cli-and-env-variables-ref.rst
_api/index.rst python-api-ref.rst
4 changes: 4 additions & 0 deletions docs/apache-airflow/start/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
/dags
/logs
/plugins
/.env
28 changes: 28 additions & 0 deletions docs/apache-airflow/start/airflow.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
#!/usr/bin/env bash
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

#
# Run airflow command in container
#

PROJECT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

set -euo pipefail

export COMPOSE_FILE=${PROJECT_DIR}/docker-compose.yaml
exec docker-compose run airflow-worker "${@}"
134 changes: 134 additions & 0 deletions docs/apache-airflow/start/docker-compose.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#

# Basic Airflow cluster configuration for CeleryExecutor with Redis and PostgreSQL.
#
# WARNING: This configuration is for local development. Do not use it in a production deployment.
#
# This configuration supports basic configuration using environment variables or an .env file
# The following variables are supported:
#
# AIRFLOW_IMAGE_NAME - Docker image name used to run Airflow.
# Default: apache/airflow:master-python3.8
# AIRFLOW_UID - User ID in Airflow containers
# Default: 50000
# AIRFLOW_GID - Group ID in Airflow containers
# Default: 50000
# _AIRFLOW_WWW_USER_USERNAME - Username for the administrator account.
# Default: airflow
# _AIRFLOW_WWW_USER_PASSWORD - Password for the administrator account.
# Default: airflow
#
# Feel free to modify this file to suit your needs.
---
version: '3'
x-airflow-common:
&airflow-common
image: ${AIRFLOW_IMAGE_NAME:-apache/airflow:master-python3.8}
environment:
&airflow-common-env
AIRFLOW__CORE__EXECUTOR: CeleryExecutor
AIRFLOW__CORE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__RESULT_BACKEND: db+postgresql://airflow:airflow@postgres/airflow
AIRFLOW__CELERY__BROKER_URL: redis://:@redis:6379/0
AIRFLOW__CORE__FERNET_KEY: ''
AIRFLOW__CORE__DAGS_ARE_PAUSED_AT_CREATION: 'true'
AIRFLOW__CORE__LOAD_EXAMPLES: 'true'
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
- ./plugins:/opt/airflow/plugins
user: "${AIRFLOW_UID:-50000}:${AIRFLOW_GID:-50000}"
depends_on:
redis:
condition: service_healthy
postgres:
condition: service_healthy

services:
postgres:
image: postgres:13
environment:
POSTGRES_USER: airflow
POSTGRES_PASSWORD: airflow
POSTGRES_DB: airflow
volumes:
- postgres-db-volume:/var/lib/postgresql/data
healthcheck:
test: ["CMD", "pg_isready", "-U", "airflow"]
interval: 5s
retries: 5
restart: always

redis:
image: redis:latest
ports:
- 6379:6379
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 5s
timeout: 30s
retries: 50
restart: always

airflow-webserver:
<<: *airflow-common
command: webserver
ports:
- 8080:8080
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:8080/health"]
interval: 10s
timeout: 10s
retries: 5
restart: always

airflow-scheduler:
<<: *airflow-common
command: scheduler
restart: always

airflow-worker:
<<: *airflow-common
command: celery worker
restart: always

airflow-init:
<<: *airflow-common
command: version
environment:
<<: *airflow-common-env
_AIRFLOW_DB_UPGRADE: 'true'
_AIRFLOW_WWW_USER_CREATE: 'true'
_AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
_AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}

flower:
<<: *airflow-common
command: celery flower
ports:
- 5555:5555
healthcheck:
test: ["CMD", "curl", "--fail", "http://localhost:5555/"]
interval: 10s
timeout: 10s
retries: 5
restart: always

volumes:
postgres-db-volume:
Loading